Concerns about privacy have changed over time, and the Census Bureau has changed its approach in response. The earliest (1790) census takers were required to post census lists in the town square for people to check and revise.1 Until 1929, census records could be bought and shared. Until 1976, the Census Bureau’s director had the discretion to grant disclosure exemptions.
The Census Bureau first began making promises of confidentiality to businesses as a way of addressing poor response rates in 1840, while the 1850 Census was the first one in which responses were not posted publicly. Still, promises were not always kept. During World War I, census records were used to support the military draft, despite President Taft’s earlier promises of confidentiality. Protections added in 1940 were overturned by the Second War Powers Act in 1942 (though it expired in 1947). As a way of more securely establishing the right to confidentiality, Congress passed Title 13 in 1954 (amended in 1962 and 1990), which both places requirements on the Census Bureau and protects the data from others: that is, “it is against the law to publish any private information that identifies an individual or business such. . .[as] including names, addresses [including GPS coordinates], Social Security Numbers, and telephone numbers.”2
___________________
1 https://www.census.gov/library/visualizations/2019/comm/history-privacy-protection.html
2 https://www.census.gov/history/www/reference/privacy_confidentiality/title_13_us_code.html#:~:text=It%20is%20against%20the%20law,Security%20Numbers%2C%20and%20telephone%20numbers
Over time, the Census Bureau has also changed its statistical procedures for protecting data, starting with the suppression or compression of potentially disclosive data in 1920 to the progressive adoption of techniques such as data swapping, whole table suppression, rounding, top-coding, and bottom-coding. Most recently, for the 2020 decennial census, the Census Bureau adopted differential privacy (Abowd et al., 2022)—a change that has provoked debate.
American society also has changed, with growth in the ways data are collected and used, including the combination of multiple databases to create more complete records. Nissenbaum (2010) notes some of the important changes:
The rapid growth of artificial intelligence has the potential to further increase disclosure risks, increasing the capacity to search across multiple data sources in order to find data that will identify a respondent.
These changes place the Census Bureau’s Survey of Income and Program Participation (SIPP) in an increasingly difficult position. The survey collects data that are highly personal and confidential, including data on income, financial assets and liabilities, and household structures. Such data would be of interest to many, including data aggregators, lenders, advertisers, and identity thieves. Based on Title 13, the Census Bureau is legally and ethically bound to protect the confidentiality of SIPP survey respondents. As a practical matter, promising privacy is also a valuable tool for reassuring survey respondents that they may safely provide accurate information, affecting both survey response rates and data quality.
Yet what is needed to protect confidentiality? The simplest and most traditional approach has been to strip clearly identifying information from the data, such as names and addresses. Beyond that, the Census Bureau has used tools such as top-coding, bottom-coding, data coarsening, and
perturbations to help protect the privacy of the data. Even with these measures, a problem remains, because it is not just single pieces of data that may identify a person but clusters of multiple pieces of information that may collectively identify a person: “it is well known that a small set of attributes can single out an individual in a population” (Privacy Preserving Techniques Task Team, 2023, p. 9). Nissenbaum (2010) writes the following:
Data subjects and third-party harvesters alike are keenly aware of qualitative shifts that can occur when bits of data are combined into collages. This is, surely, one of the most alluring transformations yielded by information sciences and technologies. It is anything but the case that an assemblage of bland bits yields a bland assemblage. The isolated bits may not be particularly revealing, but the assemblages may expose people quite profoundly. (p. 123)
Through re-identification studies of the 2010 Census, the Census Bureau found that one in six individuals in the U.S. population could be re-identified using publicly available data (Bowen, 2022, p. 37) by using the Census Protected Identification Key, block, sex, and age:
Our simulated reconstruction-abetted reidentification attack demonstrated that the tabular summaries from the 2010 Census can be converted into a 100% microdata file with geographic precision to the census block-level. Our simulated attack demonstrated that, depending on the quality of the external data used, between 52 and 179 million respondents to the 2010 Census can be correctly re-identified from the reconstructed microdata (Hawes, 2021).
SIPP differs from the decennial census in many ways, but one core distinction is that it is based on statistical sampling. In a census, it is readily apparent which combinations of characteristics will uniquely identify individuals; in a sample survey such as SIPP, one can use a small set of attributes to uniquely identify individuals among the survey respondents, but it is less clear whether those same attributes are unique within the entire population. The use of sampling does not change whether a particular combination of characteristics is unique, but it changes whether one can know that the combination uniquely identifies an individual in the population. However, there are tools for estimating the likely success of an attempt at re-identification: one study, using generative models, estimated that “99.98 percent of Americans would be correctly re-identified in any dataset using 15 demographic attributes” (Rocher et al., 2019). Furthermore, if external data are available for the entire population, then the potential for determining that an individual is uniquely identified is greatly increased.
Two practical examples illustrate the issues that SIPP presents with respect to protecting confidentiality. In SIPP’s 2020 data, there was only one
respondent who had the following characteristics: a Black female in Idaho. However, given that 0.9 percent of people in Idaho in 2020 were Black,3 it seems highly unlikely that such a person is unique in the entire population of Idaho. On the other hand, SIPP data also show a household in Florida with a Black male born in 1946 married to, and sharing the household with, an Asian female born in 1941 and with a child born in 1968 in the household; given the multiracial nature of the household and the combination of ages, such a combination might very well be unique in Florida. Moreover, many other types of characteristics might be added in from SIPP, such as occupation, educational level, and homeowner/renter status, to make the set of potential matches for the respondent small. Thus, a household that has not already been uniquely identified in the sample has an increased likelihood of being identified by including data on additional characteristics. This is one way in which SIPP may be more disclosive than the decennial census, despite being based on a statistical sample: it contains a substantial amount of highly detailed information, also including changes from one year to another. The potential ability to precisely identify unique households gives good reason to examine disclosure avoidance protections carefully.
Though SIPP faces challenges with regard to protecting confidentiality, these challenges can be met. The Commission on Evidence-Based Policymaking (2017, p. 1) wrote the following:
Traditionally, increasing access to confidential data presumed significantly increasing privacy risk. The Commission rejects that idea. The Commission believes there are steps that can be taken to improve data security and privacy protections beyond what exists today, while increasing the production of evidence. Modern technology and statistical methods, combined with transparency and a strong legal framework, create the opportunity to use data for evidence-building in ways that were not possible in the past.
To determine the most appropriate disclosure avoidance procedures to use for SIPP, the Census Bureau asked the National Academies of Sciences, Engineering, and Medicine to convene an expert panel consisting of experts in statistics, survey methodology, economics, computer and data science, policy evaluation, and sociology.
The following charge (Box 1-1) was given to the expert panel. Numbering has been added to make it easier to refer to different parts of the
___________________
3 https://www.census.gov/library/stories/state-by-state/idaho-population-change-between-census-decade.html#:~:text=Race%20and%20ethnicity%20(White%20alone,%25%2C%20up%20from%2054.9%25)
statement of task and is not meant to imply either a sequential process or prioritization of the different elements.
The panel’s first step was to examine the statement of task to determine what information would be needed. The panel first met with the Census Bureau to discuss its view of the task statement. After internal discussions about the statement, the panel decided there were four key questions to answer:
The panel next examined what information it needed to address the statement of task. Collectively, the panel members brought experience and expertise in working with SIPP data and in disclosure avoidance approaches (including traditional techniques, controlled virtual access, synthetic data, and differential privacy) with different panel members having different specialties. In part, the panel engaged in internal, mutual instruction, so that all panel members would share a common level
of understanding; this included briefings and discussions on differential privacy, controlled virtual data access, and balancing privacy and usability.
The panel also required extensive information about SIPP and how it is used. The Census Bureau provided briefings giving the panel an overview of SIPP and describing key topics that would be relevant. These covered how SIPP works, what its major issues are, what decisions about SIPP and data protection had already been made, and what are SIPP’s content, products, and concerns. They also covered SIPP uses of administrative data during production processing; background on SIPP synthetic data and numbers and types of users; SIPP’s current and desired level of security; new developments within the Census Bureau on disclosure limitations; SIPP small area estimation, key estimates, and data quality; and a recent re-identification study conducted by the Census Bureau relating to SIPP.
The panel conducted multiple literature searches to identify the uses of SIPP, focusing on four search methods: (1) a bibliography provided by the Census Bureau on its website of research based on SIPP; (2) a search of publications within the past five years citing SIPP and focused on the topics addressed and the methods used; (3) a search of articles within the past three years on disclosure avoidance, as well as articles on the impact that promising privacy has on survey response rates; and (4) a Scopus citation search of articles listing SIPP in the title or abstract, looking particularly at the 50 publications with the most citations.
The panel downloaded considerable material about SIPP from the Census Bureau website, including the history and design of SIPP, documentation about SIPP’s public-use file, and SIPP’s 2020 public-use file itself. Selected tabulations were produced from the downloaded data file and then reviewed for accuracy by analysts at the Census Bureau.
Finally, the panel issued a call for information, asking SIPP data users to complete a short online questionnaire about how they used the data and what problems they experienced (see Appendix E). These data from 65 SIPP data users cannot be considered to be a nationally representative sample, but they provide a greater level of detail about how SIPP’s data were used than would otherwise be available. All statistical findings were confirmed by a second National Academies staff member.
Table 1-1 provides a list of all the briefings provided to the panel by both outside and internal experts.
In some areas the panel relied on its own internal expertise where judgment calls were required or the information, by its nature, could not be clearly documented. For example, this report states that no software for creating synthetic data files is currently ready for handling a file with the size and complexity of SIPP. Such judgments are noted in the text where they occur.
TABLE 1-1 List of Briefings Provided to the Panel
| Date | Topic | Presenter(s) |
|---|---|---|
| 6/6/22 | Introduction to Survey of Income and Program Participation (SIPP) and Expectations of the Panel | Jason Fields—Census Bureau |
| 6/30/22 | The SIPP Synthetic Beta | Rachel Shattuck—Census Bureau |
| 6/30/22 | Model-based Imputation and Administrative Records in SIPP Processing | Benjamin Gurrentz—Census Bureau |
| 6/30/22 | An Introduction to the SIPP Content, Products, and Concerns | Adriana Hernández-Viver, Robert Munk, Yerís H. Mayol-García—Census Bureau |
| 6/30/22 | Small Area Estimates for the SIPP | Benjamin Gurrentz, Sam Szelepka—Census Bureau |
| 6/30/22 | SIPP Key Estimates and Data Quality | Ashley Westra—Census Bureau |
| 6/30/22 | SIPP’s Current Level of Security | Holly Fee—Census Bureau |
| 6/30/22 | Protecting Respondent Confidentiality in the SIPP | Gary Benedetto, Rolando Rodriguez—Census Bureau |
| 9/7/22 | Balancing Data Privacy and Usability in the Federal Statistical System | V. Joseph Hotz—Duke University, Robert A. Moffitt—Johns Hopkins University |
| 9/7/22 | A Penny Synthesized is a Penny Earned? An Analysis of Synthetic Earnings Using Survey Responses and Administrative Records | Jordan Stanley, Evan Totty—Census Bureau |
| 9/7/22 | Synthetic Data and Census Bureau Directions for Privacy Protection | Jerry Reiter—Duke University |
| 10/3/22 | Statistics and Privacy | danah boyd—Microsoft Research and Georgetown University |
| 10/3/22 | SIPP User Experiences | Bradford Chaney—National Academies |
| 11/14/22 | A Modern Container-based Approach for Development of and Access to Confidential Data | Lars Vilhuber—Cornell University |
| 11/14/22 | Differential Privacy | Salil Vadhan—Harvard University |
| 12/12/22 | Restricted Data Access at Inter-university Consortium of Political and Social Research | Amy Pienta—University of Michigan |
| 3/21/23 | SIPP 2014 Panel Re-identification (Re-id) Study Findings and Recommendations | Aref Dajani, Steve Clark, Phyllis Singer—Census Bureau |
The panel divided into three teams to consider what the report’s content should be and draft the report itself. Each panel member served on two teams. The following topics delineated the teams:
Each team reviewed and assessed the report outline to ensure that all important topics were addressed and that the report was well organized. Each team approached its topic by conducting literature reviews to identify the most up-to-date research on it, considering nuances, and conducting assessments of the published research based on analysis of SIPP data. Teams met in separate, closed, remote meetings three times, each meeting followed by a closed panel meeting to review and summarize their progress.
In preparation for the final closed hybrid panel meeting, the teams met a fourth time—making plans to draft their sections of the report and prepare presentations for the panel meeting. At the panel meeting, the draft report was reviewed chapter by chapter, with each team presenting the parts of the report they had prepared. Content was discussed and revisions agreed to. The documents were revised and updated to prepare the first draft of the report.
The panel met again in a closed session to finalize the conclusions and recommendations. After further edits to the text, the report was sent out for review by six independent experts in the following fields: experience with SIPP, disclosure avoidance approaches, demography, and small area estimation. The panel met one final time in closed session to discuss the comments received from the outside review.
This report is organized in the following manner. Chapter 2 provides an overall summary of SIPP, describing how the survey is conducted, what data are collected, how the data are used, and what disclosure avoidance protections are currently in place. Chapter 3 examines how disclosure risks occur and how disclosure risk is currently or should be measured. Chapters 4 through 8 collectively discuss the disclosure avoidance approaches that are available. Chapter 4 provides a general overview of disclosure limitation approaches, discussing what approaches are available and how they might be used collectively as a package of different approaches for different situations. Chapter 5 begins a more detailed examination of individual disclosure approaches, looking at secure online data access, while Chapter 6 discusses partially synthetic datasets, Chapter 7 discusses a table generator
and remote analysis platform, and Chapter 8 examines the special challenges presented by geographic variables and how they might be addressed.
Differential privacy is both a metric used to guide disclosure avoidance and a framework for developing tools that limit disclosure with respect to this metric. Thus, it is discussed primarily in Chapters 3 (as a measurement tool) and 4 (in the overview of disclosure avoidance approaches), along with references in other chapters as appropriate, and again in Appendix C. Chapter 9 examines ways to create a balance between promoting usability of SIPP data and preserving confidentiality. It also describes the experiences of SIPP users, based on 65 responses from those users who completed a questionnaire about their experiences. Finally, Chapter 10 provides a summary of the panel’s conclusions and recommendations.
Six appendices provide supplementary and generally technical information, particularly including formulas where appropriate. Appendix A provides information on measuring disclosure risk, complementing Chapter 3. Appendix B provides information on making inferences based on synthetic data, as referenced in Chapter 6. Appendix C provides technical information on differential privacy in table generators, as referenced in Chapter 7. Appendix D concerns geography variables, as discussed in Chapter 8. Appendix E provides a description of how data were collected from SIPP users who responded to the call for information and provides more detailed statistical results than are contained in the main text. Appendix F provides a list of references for the literature review that is discussed in Chapter 9 and summarized in Figures 9-1 and 9-2. Appendix G provides biographical sketches on the panel members.
This report does not address how future malicious actions might be anticipated and prevented by identifying suspicious behavior that seems directed toward identifying respondents. It is difficult to identify actions that might be taken with regard to public-use files, since the downloading of multiple files by itself is not a suspicious action; also, the Census Bureau does not require people to register to download a public-use file, and thus does not have records of data users. When operating within a user agreement, such as through a Federal Statistical Research Data Center, there are mechanisms in place to prevent the release of confidential information, though those mechanisms could potentially be expanded, such as through the use of artificial intelligence.
The current study is one of several conducted by the National Academies over the past 34 years. Following is a brief description of the earlier studies, along with those conclusions and recommendations from them that are most relevant to the current study.
Five years after the initiation of SIPP, the Census Bureau and the U.S. Office of Management and Budget asked the Committee on National Statistics at the National Academies to perform an independent study of SIPP. In an initial interim report, the committee examined the goals of SIPP, how well the survey was meeting the goals, and the quality and utility of SIPP data products. The committee found that “SIPP is making a vital contribution to understanding the characteristics and dynamics of the population at economic risk, and the ways in which federal programs meet—or fail to meet—economic needs” and that SIPP “provides data not elsewhere available that is integral to policy analysis of income maintenance programs” (National Research Council, 1989, p. ix). As part of the study, the committee conducted interviews with selected federal agencies, finding that six of them made major use of SIPP: Food and Nutrition Service, U.S. Department of Agriculture (USDA); Census Bureau, U.S. Department of Commerce; Assistant Secretary for Planning and Evaluation, U.S. Department of Health and Human Services (DHHS); Social Security Administration; DHHS Congressional Budget Office; and Congressional Research Service. Eight other agencies made occasional use of SIPP: Economic Research Service, USDA; National Center for Health Services Research, DHHS; Family Support Administration, DHHS; U.S. Department of Education; U.S. Department of Housing and Urban Development; Bureau of Labor Statistics, U.S. Department of Labor; U.S. Office of Management and Budget; and U.S. Commission on Civil Rights. Finally, two agencies were found to have a potential use of SIPP: Bureau of Economic Analysis, U.S. Department of Commerce; and the U.S. Department of the Treasury.
Following are the recommendations from the 1989 report that are most relevant to the current study:
This 1993 study was part of a reassessment and redesign effort conducted by the Census Bureau after roughly nine years of SIPP’s operation (National Research Council, 2009). The panel reviewed the survey’s goals, content, and relationship to other data collections; survey and sample design; data collection and processing; publications and other data products; analytical methods for using the complex longitudinal data; methodological research; and management and oversight of the SIPP program. Much of the report concerned the content and methodology of SIPP, which are outside of the charge to this panel, but a few of the report’s recommendations are particularly relevant:
This study was part of a re-engineering of SIPP started in 2006 by the Census Bureau, with the panel focusing on the linking of administrative records and SIPP data (National Research Council, 2009). Following are some of the most relevant conclusions and recommendations coming from that study.
Following a redesign of SIPP in 2014, this study was designed as an independent evaluation comparing the new design and old design by comparing key estimates, evaluating the content, evaluating the impact on
respondent burden, and considering content changes for future improvement of SIPP (National Research Council, 2009). Some relevant recommendations follow.