Agencies seeking to blend data need to manage potentially heightened risks of disclosure and subsequent harms at various points of the data-blending lifecycle, while ensuring the ultimate data products have sufficient usefulness to justify the risks. Agencies can use a variety of technical and policy approaches, possibly in combination, to achieve disclosure risk/usefulness trade-offs that are acceptable to various stakeholders. Confronted with myriad options, how can an agency arrive at an ideal approach?
Answers to this question are necessarily context specific. However, some decision points are common to many data-blending scenarios. This chapter suggests a framework for disclosure risk/usefulness trade-off decisions, built on findings presented in Chapters 2 and 3. The framework is intended to provide a minimal set of questions to address at specific stages of the data-blending lifecycle, with the goal of facilitating technical and policy approaches to manage disclosure risk/usefulness trade-offs. The framework does not aim to cover all data-blending situations and challenges, nor does it stipulate precise technical or policy approaches for certain data-blending tasks. Building a new data infrastructure will take time, during which new technical and policy approaches will likely be developed by research and practitioner communities; new legal and regulatory landscapes, as well as new societal attitudes toward privacy and confidentiality, are also likely to arise. Any framework would do well to allow for such possibilities.
Technical and policy approaches for nonblended data have evolved over time, as have statistical agencies’ perspectives on managing disclosure
risk/usefulness trade-offs. The Census Bureau’s decision-making process for the disclosure-avoidance system of the 2020 Census tabular data products is a case in point (see Abowd et al., 2022; Hawes, 2023; National Academies of Sciences, Engineering, and Medicine, 2023d). As is well documented, the Census Bureau used data swapping as the primary protection mechanism for 2010 Census tabular data products, which was viewed as the best-available method at the time (State of Alabama, et al., v. United States Department of Commerce, et al., 2021). However, leading up to the 2020 Census, the Census Bureau performed a variety of reconstruction and reidentification attacks that convinced the Census Bureau that data swapping was insufficient to protect 2020 Census data products (Long, 2020; State of Alabama, et al., v. United States Department of Commerce, et al., 2021). As a result, the Census Bureau engineered and applied a noise-addition system that satisfied a variant of differential privacy. The Census Bureau assessed the disclosure risk/usefulness trade-offs by engaging stakeholder communities, ultimately settling on a higher privacy-loss budget than originally proposed, to allow for higher data usefulness (U.S. Census Bureau, 2021a). Other examples of innovative approaches to data sharing include secure multiparty computation (Varia, 2023), synthetic data (Kinney et al., 2011), virtual secure data enclaves (National Opinion Research Center, 2023), and applications of differentially private algorithms (Dauberman & Arnesberger, 2023).1
There are no zero-risk or “one-size-fits-all” privacy- and confidentiality-protection methods for blended data, and trade-offs among disclosure risks and potential usefulness will always exist. Ultimately, the determination of acceptable risks given potential harms and anticipated usefulness is a policy decision. Ideally, this decision is informed by technical methods and experts, as well as input from stakeholder communities.
Conclusion 4-1: Technical and policy approaches in combination are necessary for effective management of disclosure risks.
The proposed decision-making framework is summarized in Box S-1. In this chapter, we expand policy approaches described in Chapter 3 to include mechanisms like data enclaves and licensing, as well as incentives for responsible data use. The framework begins with a simple but critical question for agencies considering a data-blending project: What do we want to accomplish with blended data, and why?
___________________
1 See also https://www.opportunityatlas.org/
To a large degree, the specific tasks in a data-blending effort, as well as the privacy and confidentiality considerations and tools for disclosure protection, depend on the auspice and purpose of the project. By auspice, we mean the organization or individual that is requesting, funding, or approving the project. By purpose, we mean how the information will be used.
The auspice of the project determines the authority the data holder has available to expend toward accomplishing some of the necessary tasks of data blending. For example, if a federal agency contracts for a particular project, that agency can better facilitate data access, whereas a foundation-supported or grant-funded academic researcher might have to rely on relationships that facilitate data access by sometimes-opaque processes. The auspice can also affect the type of data and how those data can be shared, by statute. As examples, Congress can commission projects and provide the authority to complete them, but the Illinois Department of Employment Security can share unemployment insurance data only with public entities. Auspice may come from federal, state, or local government agencies, or private-sector organizations.
The purpose of the project is driven by the research or policy questions intended to be addressed. The purpose also determines the end products, which in turn affect the confidentiality protections that need to be considered. Many blending purposes involve the dissemination of research data, but others are for information that will potentially be used only by a limited group of individuals. Sometimes, a set of blended data may only be needed for a time-limited purpose. If data are never intended to be disseminated, that may influence the plan for confidentiality protection.
The auspice and purpose can have many implications for the data-blending project, from how data are assembled to how they are analyzed or shared. Thus, it is worthwhile for agencies to begin a data-blending task with the following questions:
Some examples, alone or in combination, are listed below.
Potential uses of blended data clearly influence the lifecycle of the blending process. For instance, if researchers or policy evaluators will analyze blended data for certain causal questions, data need to include as many potential confounders as possible, to facilitate accurate estimation. If uses of blended data require only summary statistics, it may be sufficient to provide only such statistics rather than widespread access to record-level datasets.
Some applications have multiple objectives—for example, creating public-use files and releasing official statistics. In such cases, agencies need to account for privacy and confidentiality considerations for all products together.
The auspice and purpose can help agencies initially assess potential disclosure risks and the nature of subsequent harms ensuing from the data-blending process. For example, greater auspice would likely be required for the blending activity if ingredient data are subject to specific (stringent) legal requirements for privacy and confidentiality protection (e.g., U.S. Code Titles 13 or 26, the Family Educational Rights and Privacy Act [FERPA; 1974], state and local laws). Stringent requirements need to be accounted for in plans for data procurement, blending, analysis, and sharing.
___________________
2 Surveillance studies can also cause privacy concerns if used by enforcement agencies without appropriate legal frameworks.
In some cases, initial disclosure risk/usefulness trade-off assessments of potential ingredient datasets can be made from previously released blended data products (Mueller-Smith, 2023). When potential ingredient data files of interest may be available separately from previously released blended data products, agencies know that using these files will demand careful consideration of the effects of composition alluded to in Chapter 2.
Lastly, agencies may be able to characterize potential disclosure-related harms to data subjects (Altman et al., 2015). In particular, agencies can identify socially sensitive variables—such as demographic, geographic, psychological, educational, financial, physical, and health variables—that, if disclosed in the blending process, could cause harm to data subjects or data holders. These likely need special attention in technical and policy approaches.
Auspice and purpose can also help agencies make an initial assessment of the potential usefulness of blended data. For example, an agency may characterize blended data products as highly useful if they facilitate investigation of policy questions that could contribute to major improvements in society, such as better estimates of economic activity, better surveillance of emerging diseases, or better responses to disasters (Kearns, 2023; Waters, 2023).
Agencies can benefit from engaging stakeholders in initial high-level assessments of auspice and purpose. For instance, agencies can consult with data holders and representatives of data subjects to characterize potential harms from disclosures, and they can consult with policymakers and data analysts to characterize data usefulness. Such consultations can aim to identify the benefits (or other incentives) for holders of ingredient datasets, the groups that benefit from the data-blending project, the groups that would bear the disclosure risks, and representatives from stakeholder groups for engagement. As an illustration of such consultations, the Office of Refugee Resettlement runs the Annual Survey of Refugees, which collects data during refugees’ first 5 years after arrival to the United States. Data users would like to blend survey data with additional information such as state of resettlement and occupation. However, data subjects and data holders have high data-privacy and data-confidentiality concerns, especially for refugees who are fleeing from sensitive countries. Responsive to these concerns, the Office of Refugee Resettlement does not presently collect this information. However, some data users have suggested a compromise still could be useful—for example, blending for larger states only (Triplett, 2023). This report’s framework for managing disclosure risk/usefulness trade-offs may assist in identifying the best technical and policy approaches for achieving an acceptable compromise.
Consultations with stakeholders can aim toward developing high-level plans based on disclosure risk/usefulness trade-offs. We provide an example in Figure 4-1, which displays a matrix of high-level characterizations of harm and usefulness, and potential actions that an agency may take based
on those characterizations. Ideally, such high-level characterizations are informed by data subjects, data users, privacy experts, and the decision makers within the participating agencies.
Once the auspice and purpose have been determined, the next question is, “What data files are needed to accomplish the blending task?” The ingredients of a data-blending project are determined by the purpose of the blended data product—ideally, ingredient datasets need to include the variables and adequately representative records needed to make useful end products—and also by what is needed to accomplish the blending. For example, when blending involves linking records across data files, ingredient datasets need commonly defined fields that can be used for matching.
Determining the right ingredients needed to accomplish a particular project can be time consuming, especially when established canonical, curated datasets do not exist. Some datasets desired for blending may have been curated for a purpose other than blending. For example, many state datasets reported to federal agencies are set up for point-in-time analysis and not longitudinal analysis, which requires following entities over time.
Ingredient data files can also have privacy and confidentiality concerns that need to be addressed to enable blending, ranging from legal requirements for the blending approach to the method for data sharing. As an example, blending may require linking source files on data subjects’ (direct or indirect) identifiers as well as bringing together sensitive variables in one combined file; these processes can introduce confidentiality concerns. As such, it is prudent for agencies to consider the following questions as part of determining ingredient data files.
Ingredient data files may come from federal agencies, state/local agencies, private-sector companies, or other parties. Once the ingredient files are identified, confidentiality policy requirements and the proposed disclosure-limitation methods to be applied need to be shared with and explained to stakeholders before blending is attempted. Naturally, some data files may be governed by privacy statutes/regulations that determine the technical and policy approaches acceptable and appropriate to accomplish the blending task.
It also is important to indicate which groups have access to ingredient data files (e.g., only one agency or the general public). For example, one agency may have already released some microdata files or analysis results
from a file chosen as ingredient data for blending. Prior release puts an extra burden on confidentiality-protection strategies due to composition effects as described in Chapter 2.
When compiling ingredient files, agencies can take several steps to reduce disclosure risks and enhance data usefulness. Agencies are required to consider data minimization (Commission on Evidence-Based Policymaking, 2017, p. 49; National Academies, 2017c, p. 83). To determine these minimal datasets, as well as the appropriate units of analysis for blended data and metadata needed for analyses, agencies can enlist stakeholders for input, including stakeholders that hope to receive information from the data and those that provide ingredient data. Even with data minimization, it is typically good practice for agencies to maintain private (e.g., encrypted) versions of identifiers or other variables needed for blending, even if those variables are only available within agency firewalls. This can facilitate future data linkages and data sharing. Related, agencies can enhance data usefulness by preserving paradata3 or other information that speaks to the quality of blended data. Without such information, agencies seeking to reuse existing data sources for blending may need to sift anew through thousands of fields to reassess data quality and data equity.
After selecting files to blend, the next question to address is, “How do we obtain access to those files?” The procurement process could have implications for the privacy- and confidentiality-protection strategies applied to blended data. In particular, the procurement process may depend on several factors, including the laws and regulations governing the privacy and confidentiality of data, the desiderata of the participating data holders and subjects for disclosure avoidance, the technology available to enable data sharing, and the nature of the variables contained within the data. These features may lead agencies to choose certain strategies or discourage the use of others. As examples, with Title 13–protected or Title 26–protected data, data access may need to occur within systems maintained by the Census Bureau or Internal Revenue Service (IRS; e.g., a Federal Statistical Research Data Center [FSRDC] for linkable microdata or data dashboards for aggregated statistics; Dauberman & Arnesberger, 2023; Mueller-Smith, 2023), and participating data holders then need to transfer their data to
___________________
3 Paradata are data generated as a by-product of the data-collection process.
these agencies for blending. Some data holders, such as those in the private sector, may require that access to their data occur only inside company systems, due to security concerns.4 In other settings, the agency may only permit access to redacted microdata or aggregated data, as with EEO-1 data from the U.S. Equal Employment Opportunity Commission (2023). In other contexts, participating agencies may require that only a trusted third party have access to ingredient data files, as may occur with a National Secure Data Service demonstration project.5 Finally, parties may be willing to perform secure multiparty computation without directly sharing records from ingredient data files across parties, as can be seen for certain health data (Mirel, 2023) and for retail trade data from private companies (Committee on Economic Statistics, 2022).
Considering these issues, it can be prudent for agencies to consider the following aspects of data procurement:
Some examples, alone or in combination, are listed below.
___________________
4 Pat Bajari, Amazon Chief Economist, comment during Committee on Economic Statistics Meeting at the 2022 American Economic Association and Allied Social Science Association meetings. The program can be found at https://www.aeaweb.org/about-aea/committees/economic-statistics/annual-meeting/2022
5 See https://ncses.nsf.gov/about/national-secure-data-service-demo#card1850
6 See researchdatagov.org for a catalogue of confidential datasets available from federal statistical agencies and https://manager.researchdatagov.org/RDG_User_Guide.pdf for the SAP application and users’ guide.
If usefulness is not sufficiently high compared to potential risks or if the infrastructure is not up to task, agencies may choose to reconsider whether blending remains worthwhile. As an example, suppose the purpose of a blended data project is to inform policy regarding specific groups. If those groups are not likely to be adequately represented in blended data, then the blended data project could misinform—and even harm—those groups (Bowen & Snoke, 2023). In that case, risks to privacy and confidentiality from blending may simply not be worth the benefit.
The next step is to construct the blended data product. Naturally, this step depends significantly on how ingredient files can be accessed and, for cases in which record linkage is required, what information is available to facilitate linkages. Choices for blending methods are numerous and include techniques for record linkage and methods for secure distributed
___________________
7 COBIT (Arezki & Elhissi, 2018) and ITIL (ITSM Docs, 2021) are examples of industry standards for assessing the soundness of information technology (IT) business functions within organizations; it may be possible to leverage aspects of these standards to assess IT infrastructure adequacy for blended data projects. Specifically, “ITIL is a framework that enables IT services to be managed across a lifecycle or service value chain…COBIT supports enterprise IT governance to derive the business’s maximum value through IT investments while optimizing resources and mitigating risks” (ITSM Docs, 2021).
computations. Regardless of setting, data quality (e.g., data editing, record deduplication) and documentation of blending methods are important. Although challenging, agencies are encouraged to evaluate the quality of record linkages as thoroughly as possible (Herzog et al., 2007). For example, linkage errors (e.g., matching records across databases that do not belong to the same individual), harmonization errors (e.g., data elements in ingredient data files that are measured differently), and modeling errors (e.g., an imputation or linkage model fits poorly for some subpopulations) can each contribute to inaccuracies in linked data products and in the subsequent analysis (National Academies, 2023a, pp. 190–191). Measuring and limiting the impact of such errors on the resulting analysis requires enhanced evaluation practices and can benefit from continued research in record-linkage techniques (National Academies, 2023a, p. 191). At a minimum, transparency of both linkage processes and data-quality evaluations is necessary. Such transparency can support agency communication with stakeholders about the quality of blended data, noting what is achievable given considerations of privacy and confidentiality. In this way, stakeholders can understand why certain decisions were made regarding blending methods.
Thus, it is prudent for agencies to consider the following questions:
Select a data-blending strategy that accords with the mode of access to and permitted information in the ingredient files. Some examples are provided below.
If resultant blended data are not useful for the blending objective, an agency might return to the considerations of Step 3 to devise alternative ways to access data that may enhance data usefulness. For example, suppose that after probabilistic record linkage on demographic variables, an agency determines that direct identifiers are essential for accurate and comprehensive record linkage—either for a large portion of the data or for subgroups of interest. Participating parties may need to revisit the confidentiality-protection requirements from Step 3 and establish formal agreements that permit sharing of these identifiers. As another example, suppose the infrastructure for performing a secure computation is found to be inadequate. Agencies may seek policy approaches that allow for increased data sharing. This illustrates the dynamic nature of the decision-making process.
If data are not sufficiently accurate or comprehensive, or if blending cannot be done in a sufficiently timely manner, agencies can reconsider whether the blending project is worthwhile.
After blending data, agencies need to consider how to manage the disclosure risk/usefulness trade-offs for the intended blended data products. In doing so, agencies can define how they will measure disclosure risks, how they will reduce those risks to acceptable levels, and how they will define data usefulness. As described in previous chapters, agencies can engage stakeholders to help determine disclosure-limitation practices that achieve satisfactory trade-offs. Stakeholder engagement can also help engender trust and confidence in agencies’ decisions, especially when agencies clearly communicate the rationales for privacy and confidentiality decisions (Altman et al., 2015)—for example, by referring to a tool akin to Figure 4-1. Although decisions regarding disclosure-protection strategies are necessarily tuned to the specifics of the data-blending context (i.e., the auspice and purpose, the requirements associated with ingredient datasets, and privacy considerations), the panel believes it can be helpful for agencies to consider the following issues. Of course, some of these questions may be out of an agency’s control (e.g., the legal framework for data sharing), but agencies may still be able to establish technical and policy approaches.
Previous chapters in this report detail various technical and policy approaches and when they might be gainfully applied for blending data. In some contexts, agencies might use multiple approaches, such as in tiered access approaches. We do not repeat the summaries of technical and policy approaches here; instead, we highlight several key steps in identifying these best practices.
As a default goal, agencies can seek to use disclosure metrics and protection procedures that account for risks from multiple data releases (to date, formal privacy), especially for data products with potential risks of serious disclosure harms. For data products for which risks of harm are deemed low, agencies may be willing to accept greater disclosure risks—for example, by adding less noise to the data (e.g., a larger privacy-loss budget) or by discounting risks from data releases beyond a specific set. However, agencies need to be aware that such discounting opens the possibility for future disclosure attacks, such as from the release of additional datasets that did not exist at the time blended data were published. As discussed previously and in detail below, it can be beneficial to obtain stakeholder input regarding what constitutes harm and how tolerable risks of harm may be. Additionally, agencies can share and coordinate plans for releasing data products from ingredient data files to better manage risks from compositions. Ideally, such coordination could include describing data-release plans within data-sharing agreements among agencies; establishing disclosure-avoidance strategies to minimize detrimental effects of composition; and the routine, transparent, and accessible reporting of blended data products and their properties to a wide range of stakeholders.
As discussed in Chapters 1 and 2, data usefulness is multidimensional. Agencies can select the dimensions of highest importance (e.g., accuracy of specific estimands that would be of particular policy interest, equity of data quality across groups, availability of and access to blended data) according to the purpose, engaging with stakeholders to make those determinations.
The purpose of the blended data may demand certain levels of accuracy in end products to realize intended benefits. In turn, this may lead
agencies to rule out certain methods (e.g., suppression is inadvisable when it blanks large fractions of cells needed for accurate statistical analysis) and consider others (e.g., noise addition may be particularly appropriate for tabular data with mostly large counts). Disclosure-limitation technology is changing rapidly, with computer scientists, statisticians, social scientists, and many other researchers and practitioners focusing increasing attention on the field. The panel expects that over the time it takes to establish a new data infrastructure, new methods will be developed that enable more precise quantifications of disclosure risk/usefulness trade-offs for blended data and more favorable disclosure risk/usefulness trade-offs. Agencies can benefit by staying abreast of the latest advances; indeed, the Commission on Evidence-Based Policymaking urges use of state-of-the-art disclosure-limitation methods (Commission on Evidence-Based Policymaking, 2017, p. 69). In the interim, employing approaches described in Chapters 2 and 3, as well as being open to and investing in innovation, can help agencies choose the best scientific methods currently available.
Although not detailed in this report, agencies can implement polices that discourage intentional and unintentional bad behavior, as suggested in previous National Academies of Sciences, Engineering, and Medicine’s Committee on National Statistics reports (National Research Council, 2005), among others. Several of these policies are specified in the SAP, which implements a portion of Title III of the Foundations for Evidence-Based Policymaking Act of 2018. These policies include identification (e.g., verification of subject matter expertise, intended purpose, secure environment, and feasibility of the proposal); trainings (e.g., data asset characteristics, confidential-access and data-management protocols); use agreements (e.g., documentation of what constitutes prohibited use); auditing of use (Office of Management and Budget, 2022, p. 13); and penalties to the investigator, and in some cases the institution, for misuse (e.g., rescinding licensure, barring future licenses, fines, and felony charges). Depending upon the sensitivity of requested data, applicants may be required to submit to background investigations (Office of Management and Budget, 2022, p. 18).
Implementation of best practices can require resources—people, systems, and policy agreements—each of which can be substantial. Agencies can facilitate assessments of these resources by sharing information on how they accomplished data-blending tasks—for example, using a coordinating mechanism like the Confidentiality and Data Access Committee. To marshal sufficient resources, agencies may need to consider working with outside experts, such as contractors. If this is not feasible, agencies may have to revise the goal of blending, accept greater risks or less usefulness, or lean more heavily on policy approaches to meet blended data objectives.
Agencies that seek to engage stakeholders face several tasks. First, they need to identify the groups to be engaged and their representatives. Relevant stakeholder groups can include those that are part of the blended data as well as downstream users of both ingredient data and blended data. Identifying individuals to represent the various potential stakeholders can be difficult, as stakeholders are often dissimilar. Having more than one representative from a particular group can help provide a broader range of perspectives (Bowen & Snoke, 2023).
Second, once agencies establish relevant groups of stakeholders, they need to consult with these groups while matching disclosure protections with data-blending objectives. Physically contacting stakeholders can be resource and time intensive; it also creates challenges around informed consent (see Chapter 5). It is likely that stakeholders will have differing preferences regarding disclosure risk/usefulness trade-offs. However, policymakers make the ultimate decision, as they shoulder the responsibility for any privacy and confidentiality violations when blended data–derived information is released.
Third, agencies need to communicate the strengths and limitations of various candidate blending and confidentiality-protection methods, especially technical approaches, to stakeholder groups. In doing so, it is beneficial to communicate according to stakeholders’ levels of expertise and familiarity. The same is true for communication with policymakers. To paraphrase from Bowen et al. (2022), “[A]lthough policymakers are the most equipped to understand the consequences of [disclosure risks], they are likely the least equipped to understand what [privacy and usefulness measures] mean” (p. 15). Proper communication and guidance can empower policymakers to understand these trade-offs; for example, agencies can use disclosure risk/usefulness trade-off curves for various parts and purposes of the blended data (Bowen & Snoke, 2023). Effective communication of such concepts is challenging; it may require multiple engagements and strategies. It is the panel’s opinion that continued research on communication practices is important for advancing a blended data infrastructure.
There is no zero-risk data-release strategy, and no data system with meaningful usefulness offers iron-clad guarantees of privacy and confidentiality protection. Agencies need to develop both mitigation strategies (or policy backups) to investigate and manage confidentiality breaches as well as communication strategies to address those breaches. This is common practice in data security and, in the panel’s opinion, should be common practice for a blended data infrastructure as well.
Before a blended data product is disseminated, it is prudent for agencies to develop and execute a maintenance plan. Key questions to consider include the following:
Changes may be made to ingredient data files, such as error corrections or collection of additional information. Agencies can assess whether such changes affect the quality of downstream analyses sufficiently to justify reblending the data, which could entail additional disclosure risks. The nature of the end products can also affect this decision. For example, when the end product is a research dataset made available to approved users in a data enclave, the agency may deem the additional disclosure risks for an updated file as acceptable, since data users are already expected to uphold confidentiality. When the product is a public-use dataset or a suite of summary statistics, additional release could introduce disclosure risks that need to be weighed against usefulness gains.
Decisions regarding continuing access to a data product are particularly salient when participating agencies intend to release data products derived from ingredient files. Such derived data products may unintentionally create disclosures from compositions. Plans and criteria for who can access blended data, when access terminates, and when data need updating can be developed in advance and revisited periodically. Engaging users of ingredient data and blended data in developing and maintaining these plans improves the transparency and accountability of the decision-making process.
Agencies have both legal (Foundations for Evidence-Based Policymaking Act of 2018, 2019) and ethical responsibilities (American Statistical Association, 2022) to communicate certain information about a blended data product, ranging from explicitly noting the limitations of blended data (e.g., limitations resulting from the collection processes and blending procedures) to replicating the results of the blending process. Such transparency is also essential to allow user communities to enhance the quality of the data product
and its derivatives by providing feedback for improvements. Transparency also empowers individuals to make decisions about sharing their data.
The panel’s model framework is summarized in Box 4-1.
The remainder of this chapter illustrates how agencies might apply the model framework (or a similar framework) for decision making, using some genuine data-blending scenarios. Of course, panel members are not privy to all details of the blending. Undoubtedly, these illustrative examples will miss some features and incompletely or inaccurately describe others. Our intention is to demonstrate how a set of guidelines can help agencies make decisions about disclosure-protection strategies.
We apply the framework in three case studies that each blend education data with other data, such as tax information. Two of these case studies were featured in the panel’s workshop event, as noted; the third was selected from the panel’s prior experience in complement to the first two. We chose the same subject matter—education—across the three case studies to highlight the specific kinds of challenges that arise when managing risk for various data-blending scenarios: federal and federal data, federal and state data, and state and state data. Together they demonstrate a range of auspices and purposes, privacy and confidentiality considerations, and potential approaches to manage disclosure risk/usefulness trade-offs. Although the protocols for blending have already been established in these three scenarios, we use these projects to illustrate how the framework might be used.
At the panel’s workshop event, Dauberman and Arnesberger (2023) described the approach to blending aggregated data from the Statistics of Income (SOI) Program of IRS with College Scorecard data (U.S. Department of Education, 2023) from the Department of Education (ED). Thus, this case study represents a federal-to-federal education data-blending scenario.
1. Determine auspice and purpose: The College Scorecard is intended to help future college students and their families search for and compare colleges by field of study, costs, admissions, economic outcomes, and other statistics. Since 2016, ED has provided the College Scorecard as a web-based search tool (i.e., data product) that users can query repeatedly; public-use microdata files are not required. The unit of analysis is the college, and the data are summaries of the features of the college and its students. ED
desired to add earning metrics to the College Scorecard. Such blended data could offer the public additional insights into student outcomes at various educational institutions; however, tax records contain very sensitive information and are protected by Title 26, requiring rigorous privacy guarantees. Using the decision guide in Figure 4-1, we might characterize the College Scorecard containing earnings information as being at the intersection of “Modest Impact” usefulness and “Significant and Lasting” disclosure risks.
2. Identify ingredient data files: To create the blended data, ED requires earnings data on college graduates. ED therefore turned to SOI for such data. Education data are subject to FERPA, whereas SOI data are subject to Title 26 rules. This means certain fields may be released as is (e.g., institution), but others require confidentiality protection (e.g., earnings data at the field-of-study or institution level). There is no federal source of average earnings per college, requiring the agencies to blend data at the individual student level. Therefore, the ingredient data file from ED includes recipients of federal student aid,8 and the ingredient data file from IRS includes individuals’ earnings. These files need to be linked to accomplish the blending.
3. Obtain ingredient data files: Because SOI data derive from tax records, individual students’ earnings cannot be shared outside the agency without special agreements. Thus, SOI needs to blend its data with the data supplied by ED. However, Internal Revenue Code (IRC) § 6108(b) permits SOI to release aggregated-level statistics to ED. The IRC authorizes the Secretary of the Treasury to make special statistical studies and compilations involving tax return information as defined in IRC § 6103(b)(2).
4. Blend ingredient data files: ED provides student-level records—that is, recipients of federal student aid—to SOI to conduct the matching at IRS. The data-sharing arrangement allows SOI to match student information to IRS administrative tax records, which are W-2 (i.e., wage and tax statement) and 1040-SE (i.e., net earnings from self-employment) data. Matching is done on social security numbers.
5. Select disclosure-protection approaches: For the first 3 years of the product, the College Scorecard used classical disclosure approaches (e.g., data suppression, rounding, aggregation, and top-coding) to protect blended data. During this time, ED and SOI continued to reassess the disclosure risk/usefulness trade-offs of their dissemination strategy. In 2020, ED and SOI
___________________
8 It is important to note that not all students who are in need receive federal student aid, so there may be some bias in generating the College Scorecard.
wanted to produce a second data file with greater granularity than available at the institution level—one that contained information at the credential and field-of-study level. But SOI became concerned about potential complementary disclosures in creating two datasets from the same ingredient data, as their disclosure approaches did not account for multiple data releases. SOI also determined that using the classical disclosure approaches to release the second, more granular data file would suppress and alter substantial amounts of blended data, thereby degrading the potential usefulness of the data product.9
In an update of disclosure-protection processes, SOI decided to use SafeTables, a software package developed by Tumult Labs10 that produces statistics from a formally private, noise-addition algorithm. SOI saw three benefits of using a formally private framework. First was the potential ability to quantify privacy risks when releasing datasets. Second was establishing a privacy-loss budget and a mechanism to examine specific points on the disclosure risk/usefulness trade-off. SOI’s final reason was composition: SOI could potentially track total privacy loss across multiple releases of outputs.11
Although formal privacy methods are used, the approach of ED and SOI to releasing College Scorecard data is a hybrid between formal privacy and classical disclosure approaches. ED and SOI suppress outputs from SafeTables that differ excessively (per their definition) from the confidential data values to enhance usefulness of published data. Therefore, the end product is not technically formally private; the decision to suppress is based on confidential data values, which violates the formal privacy guarantee. SOI accepted the implied higher disclosure risks, suppressing the noisiest results, because the program deemed it important that students have accurate statistics upon which to make potentially life-changing decisions.
To help determine the privacy-loss budget prior to suppressing values, SOI and ED used an evaluation tool called CSExplorer. Also developed by Tumult Labs, this tool allows the client (i.e., SOI and ED) to review and evaluate the disclosure risk/usefulness trade-offs of the output statistics for various privacy-loss budgets. SOI first used the tool to examine the effects of several privacy-loss budgets on various usefulness metrics. Once SOI identified a set of appropriate privacy-loss budgets, ED had access to a limited version of CSExplorer that was in the range of SOI’s selected privacy-loss budgets. ED then could explore the statistical outputs based on those privacy-loss budgets to examine and decide upon which suppression
___________________
9 Kelly Dauberman, personal communication, October 12, 2023.
11 Additionally, the panel notes that the use of SafeTables allowed suppression of fewer table cells than would have been required had simple cell suppression been used.
thresholds to use and other specifications for the final data product release. This tool allowed ED and SOI to have thoughtful, in-depth discussions about trade-offs between privacy and usefulness.
6. Develop and execute a maintenance plan: The College Scorecard does not publicize its disclosure-protection processes on its website, and it keeps values of privacy parameters internal.12 Providing such information along with source code for the disclosure-protection algorithms would improve transparency. The panel is also unaware of mitigation strategies for potential breaches.
We conclude this case study by mentioning an application with similar goals as the College Scorecard, namely the Census Bureau’s Post-Secondary Employment Outcomes (PSEO) dataset. This data product provides tables of earnings and employment outcomes of graduates from postsecondary institutions (Foote et al., 2019), although it does not involve IRS data. The PSEO also applies a formal privacy method to publish the tables.
At the panel’s workshop event, Penner (2023) described the approach to blending student-level data from educational agencies with earnings data from IRS, which are housed at the Census Bureau for use on approved projects. This case study represents a state-to-federal education data-blending scenario within the Census Bureau.
1. Determine auspice and purpose: Data from educational agencies can provide information about students’ and teachers’ experiences within schools, but these data generally do not include information about the experiences of these individuals outside of schools. To help policymakers and education researchers learn about outside-of-school experiences, university-based researchers partnered with school district education agencies in California, state education agencies in Oregon, and the Census Bureau, with the intent to (a) understand how students fare as they transition to early adulthood (e.g., career outcomes); (b) better measure students’ family backgrounds and their implications for schools; and (c) trace educators’ career trajectories, both before and after working in public schools. The partnership creates research datasets with individual students and teachers as units of analysis. This requires blending individual students’ or teachers’ data with other administrative records (e.g., earnings data from IRS). Thus, blended data are subject to FERPA, Title 13, and Title 26 privacy regulations. Using the decision guide in Figure 4-1, we might characterize the blended data as
___________________
12 Kelly Dauberman, personal communication, January 3, 2024.
being at the intersection of “General Knowledge” usefulness and “Significant and Lasting” disclosure risks.
2. Identify ingredient data files: The source files from the education agencies include “student-level administrative records from all eighth graders enrolled in one midsize California public school district for the 2008–2009 through 2013–2014 school years as well as similar records from the Oregon Department of Education, covering all eighth graders enrolled in Oregon public schools during the 2004–2005 to 2013–2014 period” (Domina et al., 2018, p. 541). Earnings data are obtainable from IRS tax records. “Most notably, IRS records include income information for students’ households as reported on 1040 [tax] forms filed during each year of students’ elementary school and early high school careers” (Domina et al., 2018, p. 542). These files need to be linked to accomplish blending.
Some students are missing IRS household income data. In particular, students “[…] missing IRS household income data are disproportionately low performing, receive free and reduced-price lunch at a higher rate than non-excluded students, and include a high proportion of racial and ethnic minorities” (Domina et al., 2018, p. 542). This creates a potential data-equity concern, in that the quality of blended data likely is not as high for these groups.
3. Obtain ingredient data files: Educational data are owned by the education agencies, while IRS data are housed at the Census Bureau and cannot be shared outside, by statute. Thus, record linkage required for data blending needs to be done by the Census Bureau. To enable linking, students’ and teachers’ personally identifiable information (PII) needs to be transmitted to the Census Bureau. Typically, such information can be released from education records for research purposes only with written consent. However, there are exceptions to this, and one of them is for “studies” (Student Privacy Policy Office, 2021). Under this exception, for this study, researchers need to show that “the disclosure of [PII] from student education records must be for, or on behalf of, an educational agency or institution, in order to a. Develop, validate, or administer predictive tests; b. Administer student aid programs; or c. Improve instruction” (Privacy Technical Assistance Center, 2014, p. 1). Data blending needs to be deemed as meeting this and other criteria (e.g., a written agreement, no disclosure of PII, and a data destruction plan; Penner, 2023).
4. Blend ingredient data files: Data management and blending for this project were done by the Census Bureau, using their Data Linkage Infrastructure (U.S. Census Bureau, 2022a). The Census Bureau transforms PII in all ingredient data files into a Protected Identification Key (PIK) using
the Personal Validation Identification System, which matches ingredient files to reference files created with data from the Social Security Administration (SSA) Numerical Identification (NUMIDENT) file and matches SSA data with addresses obtained from federal files (U.S. Census Bureau, 2022a). After assigning a PIK, the Census Bureau removes PII from files provided to researchers.
5. Select disclosure-protection approaches: The output products of the blending are research data files. Because of the sensitivity of the data (even though PII is not included), data are made available to approved users only via the Census Bureau’s secure computing environment (i.e., FSRDCs). Results of analyses completed inside the FSRDC undergo disclosure-limitation treatments before release, per Census Bureau Disclosure Review Board requirements that ensure the released statistics and other data products are protected by the relevant statutes, such as Title 13 and Title 26 (U.S. Census Bureau, 2021b). Researchers requesting output from the FSRDC need to submit statistics related to disclosure risks, definitions of variables and samples, and descriptions of relationships across output requests, such as the proportion of students in free/reduced lunch at certain poverty rates and the percentage of teachers who supplement their income (Penner, 2023). In this case, outputs were protected using a combination of disclosure-limitation approaches including noise infusion and, subsequently, other disclosure methods, such as applying threshold or count rules, rounding, collapsing, and suppressing cells (depending on the preferences of educational agency partners).13 Given the legal requirements and potential disclosure risks (e.g., education data are available to some individuals outside FSRDCs) and that the data-user community consists primarily of researchers and policymakers, an unrestricted-access, public-use data product, such as synthetic data, was not deemed necessary to meet the data-blending purpose.
6. Develop and execute a maintenance plan: The panel did not identify communication strategies or mitigation procedures for these linked data files. Use of the data product is limited to those purposes and researchers approved by both the educational agencies and the Census Bureau. This is deemed important to maintain the trust of the educational agencies regarding both data security and the FERPA requirement stating that the analysis needs to improve educational instruction. Therefore, the originating team of researchers are the only individuals who can access the data. The panel notes that this could create difficulties for data maintenance and research reproducibility.
___________________
13 Andrew Penner, personal communication, October 14, 2023.
As in the two previous case studies, this final illustration involves blending education and earnings data. However, it does not involve federal agencies or their data; rather, it blends state-level data sources and relies on a nonprofit research organization to manage disclosure risks and data access. It is fair to say that the states believe they receive significant utility from this activity, and that they are aware of the College Scorecard and PSEO. However, the disclosure risk management strategy differs from that applied by the Census Bureau and IRS in those products. The blending project is referred to as the Multi-State Postsecondary Report.
1. Determine auspice and purpose: The purpose of the Multi-State Postsecondary Report is to provide data and information about the employment outcomes of postsecondary experiences. For example, states are interested in how many students are employed within state after graduation, go to another participating state for employment, or have some other outcome. States seek outcomes by institution; time postcompletion; and credentials that are specific to local, cross-state issues. States deem such outcomes as unavailable in other sources, like the College Scorecard.14 Thus, participating states seek to blend educational data with administrative data from state unemployment and wage records; all data are at the state level.
Auspices for this project are the state agencies providing data and the nonprofit research organization. To be precise, the Multi-State Postsecondary Report “is produced by the Kentucky Center for Statistics, utilizing data from the Administrative Data Research Facility (ADRF) in partnership with the Coleridge Initiative, the Center for Human Resources Research at the Ohio State University […] Ohio’s state workforce and education agencies, the Indiana Commission for Higher Education, and the Indiana Department of Workforce Development. [In addition,] Tennessee approved access to state employment and wage records critical to the creation of the report” (Kentucky Center for Statistics, 2023b, p. 1). The agencies are responsible for determining acceptable levels of risk and the nature of the end products.
Given the open-ended nature of the research questions, one could argue that the best way to realize the benefits of blending is to provide researchers, both inside the agencies and externally, access to record-level, blended data. Using the decision guide in Figure 4-1, we could characterize blended data as being at the intersection of “Modest Impact” usefulness and “Significant and Lasting” disclosure risks. To provide access to record-level data, the
___________________
14 Note that some relevant statistics are available for institutions that participate in PSEO, but Kentucky (one of the states participating in the project) does not participate in PSEO.
states need to develop and implement policies to facilitate such sharing, which they elect to do with the assistance of the Coleridge Initiative.
2. Identify ingredient data files: To enable investigation of local issues, the Multi-State Postsecondary Report requires data from multiple state agencies comprising educational and earnings information. These include the Kentucky Longitudinal Data System (KLDS), the Ohio Longitudinal Data Archive (OLDA), the Indiana Commission for Higher Education, the Indiana Department of Workforce Development, and the Tennessee Department of Labor and Workforce Development. KLDS data include data from the Kentucky Council on Postsecondary Education and from the Kentucky Unemployment Insurance system. OLDA data include data from the Ohio Higher Education Information System and the Ohio Unemployment Insurance System (Kentucky Center for Statistics, 2023b, p. 1). In addition to educational and earnings data, which are sensitive, the files include social security numbers. These could be useful for data linkages, although care needs to be taken not to reveal this confidential information during blending or in end products. No federal files are needed for this blending project.
3. Obtain ingredient data files: The states have ingredient data files internally, although the files are not linked. To enable linkages and cross-state pooling, the participating states have agreed to take part in a Multi-State Data Collaborative maintained by the Coleridge Initiative. Each relevant state agency deposits its data with the Coleridge Initiative using its ADRF, which is FedRAMP®-authorized.15 Only agency-identified and agency-authorized personnel are invited to perform data transfers. To avoid sharing social security numbers across agencies or with the Coleridge Initiative, agencies create a hashed version of the social security number using software provided by the Coleridge Initiative prior to depositing data.
4. Blend ingredient data files: Because direct identifiers are available, linking records across agencies is straightforward. The Coleridge Initiative matches records on the hashed social security numbers. We are unaware of the procedures used to check linkage quality. The Coleridge Initiative provides the infrastructure for accessing the cross-state data via ADRF.
5. Select disclosure-protection approaches: The state agencies decided to use mainly a policy-oriented approach to managing disclosure risks, a strategy consistent with Figure 4-1 for the posited usefulness and risk of
___________________
15 The Federal Risk and Authorization Management Program (FedRAMP®) provides a standardized approach to security authorizations for Cloud Service Offerings. See https://www.fedramp.gov/
harm. In particular, agencies restrict data access. Approved analysts from the state agencies as well as approved researchers can access blended data files. The user-base may grow in the future as data-governance policies develop further. Approved users can produce statistics from the data files for approved projects that serve state agency purposes (Kentucky Center for Statistics, 2023a). Any results or data to be exported from ADRF need to be authorized by the agencies; unauthorized exports are prohibited.
Alternatively, the framework suggests a tiered access solution. For example, in addition to current restricted data access for state analysts, agencies could develop synthetic data files for researchers coupled with validation or verification services. Alternatively, the tier for researchers could constitute a remote-access server that provides disclosure-protected (e.g., formally private) summary statistics or outputs of statistical models. These alternative approaches would be substantial undertakings and would likely require significant investment in resources, both to establish and maintain the services.
Even with restricted-data solutions, releasing results generates potential disclosure risks. In recognition of these risks, state agency staff review all results before dissemination. In other words, state agencies are ultimately responsible for managing the disclosure risk/usefulness trade-offs from release of statistical results. Staff from the Coleridge Initiative conduct an initial review of results to be exported, but final procedures are determined and implemented by states. If requested, Coleridge Initiative staff ensure that classical disclosure-limitation methods, such as cell suppression, rounding, and noise addition, are applied to statistics prior to release from ADRF. The Coleridge Initiative provides services and tools that state agencies can use to implement these methods. The panel was unable to determine the exact details of these disclosure treatments.
It is the panel’s understanding that the state agencies generally do not employ rigorous disclosure-limitation approaches at this time. The ability of the state agencies to implement such approaches either has not been assessed or may perhaps be outside their expertise.
6. Develop and execute a maintenance plan: Since this project is completely voluntary in that state agencies decide when and how to participate, the state agencies’ executive leadership are the key decision-maker stakeholders. These decision makers share information with elected leaders. Additionally, state agency staff work with external researchers, and with each other, to align researcher and agency goals. We note that the states and the Coleridge Initiative have established processes to continually update data and to add new states. The policies and rationales used by states for specifying acceptable disclosure risk/usefulness trade-offs do not appear to be publicly available.
In sum, the Multi-State Postsecondary Report is an example of data blending in service of the needs and interests of state agencies. The disclosure risk/usefulness trade-offs as assessed by the states are qualitatively different than the trade-offs as assessed by federal agencies in Case Studies A and B, in that states have different purposes and concerns relating to within- and across-state policy and operational issues.