Significant technical advances and policy changes have increased the availability of data that can be used to inform evidence building. Blended data—that is, combined sources of previously collected data—can improve the quality of analyses, enable new analyses, and reduce burden and cost to the public. Recent federal legislation, regulation, and guidance have broadly described the roles and responsibilities for the stewardship of blended data. Yet, questions remain as the country strives to create a modern national data infrastructure. In particular, “blending data raises new concerns about protecting privacy, which may affect decisions about data access” (National Academies of Sciences, Engineering, and Medicine, 2023a, p. 46).
The Panel on Approaches to Sharing Blended Data in a 21st Century Data Infrastructure was formed to study these concerns. The Statement of Task directed the panel to identify privacy and confidentiality aspects of sharing and analyzing blended data that can be addressed by technical approaches, to identify research gaps in technical approaches, and to identify aspects that may require policy approaches. In addition, the panel was directed to develop frameworks for designing and evaluating integrated technical and policy approaches that can guide best practices for protecting privacy and confidentiality while sharing, using, and analyzing blended data. The frameworks are to be illustrated using case studies. Last, the Statement of Task directed the panel to identify areas in which approaches are not known. To accomplish these tasks, the panel was assembled to include expertise spanning statistics, survey methodology, demography, disclosure analysis, equity analysis, information security, privacy law, and public policy.
This report, Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data, is the result of the panel’s work. It is the third in a series of reports intended to describe the opportunities and challenges in achieving a new national data infrastructure based on blended data. The first report, Toward a 21st Century National Data Infrastructure: Mobilizing Information for the Common Good, provided an overall vision. The second report, Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources, described the ways in which blended data can support survey programs and the policies they are meant to inform.
The overarching theme of this third report is to describe the management of disclosure risks inherent in blended data. To do so, the report reviews current technical and policy approaches that can be used to protect privacy and confidentiality in the case of blended data. It highlights how decisions about data blending involve managing trade-offs between disclosure risks and data usefulness. It also highlights the importance of a dynamic model to manage risk—a model that allows evolution in technical and policy approaches over time, is responsive to stakeholders1 throughout the blended data lifecycle, and permits a level of data access commensurate with anticipated usefulness and acceptable risks. The report provides an example of a tiered approach, combining technical and policy measures to manage trade-offs among privacy risks, anticipated usefulness, and potential harms (see Figure 4-1). Drawing from these observations, the report suggests a model framework to guide decision makers in considering the disclosure risk/usefulness trade-offs when preparing and sharing blended data (see Box S-1).
The opportunities and challenges in managing disclosure risks in blended data are central to the development of a new national data infrastructure. Accordingly, as with the prior two reports in this series, this third report is written to engage a wide audience—agency policymakers, program staff, data holders and users, and data subjects. Public feedback on the report’s findings and conclusions will promote further development of model frameworks like the one described in this report, particularly as the data environment continues to change.
The report was informed by a virtual public workshop organized by the panel to seek input from external experts about ways that researchers and government agencies manage disclosure risks when building and sharing blended data. Panel members selected case studies to feature during the
___________________
1 For the purposes of this report, a stakeholder is defined as “[...] a person or organization who has a vested interest in the activities, a decision, or an outcome of the agency. A stakeholder may be a program beneficiary, decision-maker, partner, researcher, or even a program implementer” (The Data Foundation and the Center for Open Data Enterprise, 2023, p. 4).
workshop, to illustrate differences in (a) intended uses of blended data; (b) sources of ingredient data files (i.e., the input data elements that feed into the blending procedure); (c) potential disclosure risks and associated harms; and (d) technical and policy approaches applied to manage disclosure risks. In addition, the workshop featured presentations from thought leaders who shared their views on disclosure risk/usefulness trade-offs and communicating with stakeholders.
The blended data lifecycle spans the initial conceptualization of blended data, identifying and accessing ingredient data sources, blending the data from those sources, and sharing2 the resulting data products. Each of these stages presents potential risks to privacy and confidentiality, and subsequent harms to data subjects3 and data holders.4
As a case in point, data blending often requires linking subjects from multiple data sources. Effective linkage may require identification numbers, names, or other confidential fields to be shared across data holders. In addition, the holders of ingredient data files have partially complete information on individuals in the blended data. Ingredient file holders and other parties may be able to extract confidential information from blended data if those data are shared without adequate disclosure protection. What may seem like a safe data-sharing strategy to the creators of blended data products may be undone by the actions of ingredient file holders. Finally, disclosure harms may be magnified in the case of blended data, particularly when blending involves sensitive variables (e.g., education, health, income).
Data holders have a variety of technical and policy tools to manage disclosure risks associated with blended data. Methods or frameworks like secure multiparty computation (SMC), synthetic data, and differential privacy offer ways to reduce risks. Some of these methods are deployable now for a given context and scale; others require more research. Decisions
___________________
2 For the purposes of this report, data sharing refers to a wide range of access scenarios for data users and forms of data products. See Chapters 2 and 3.
3 For the purposes of this report, data subjects are defined as the people, entities, or organizations described by data files (National Academies, 2023c).
4 For the purposes of this report, data holders are defined as organizations that hold information of possible use in a national data infrastructure. These include federal, state, local, and tribal agencies, and other public and private-sector organizations (National Academies, 2023c).
on disclosure risk protection and data access are best made by using the best-available methods matched with the intended use of the data.
Assessing and communicating the trade-offs inherent in blended data products involve evaluating disclosure risks and harms as well as data usefulness. Quantification of disclosure risks and data usefulness is particularly challenging for blended data due to the magnified disclosure risks, complexities of multiple data holders, and the potential for multiple uses of the final products. However, effective approaches found in the research literature are being applied by some agencies today.
No data-release method that provides (nontrivial) data usefulness can guarantee zero risks to privacy and confidentiality. As a general rule, enhancing the usefulness of blended data products requires accepting greater disclosure risks.
Conclusion 2-1: Trade-offs in disclosure risks, disclosure harms, and data usefulness are unavoidable and are central considerations when planning data-release strategies, particularly for blended data. Effective technical approaches to manage disclosure risks prioritize the usefulness of some analyses over others.
Engagement with stakeholders, including data holders, data users, and decision makers, is important for effective management of trade-offs. Engagement best occurs throughout the design and implementation of privacy- and confidentiality-protection strategies. Communication plans may differ depending on the needs of relevant groups. For the public, plans ideally use plain language to describe context-specific protections. For data users, plans are most helpful when they include methods for demonstrating data quality after privacy protections are applied.
Conclusion 2-2: Effective communication with data holders and data users can help agencies understand and better manage disclosure risk/usefulness trade-offs.
Policy approaches to manage disclosure risks include, for example, laws and regulations, data enclaves and other methods that restrict data access to approved users, and incentives for responsible use of confidential data. Because policy approaches describe relationships of trust in the uses of data products, they are an essential component of all stages of the blended data lifecycle. Transparent policy processes can legitimate blended data that have moved outside their original contexts in which initial consent or other mechanisms specified limitations to their purpose.
As policy priorities change, data availability can change. As more data are made available, the potential for disclosure also increases. Additionally, technical approaches to limit disclosure risks are advancing. Even when regulatory guidance and policy procedures for managing disclosure risks in blended data are established, social acceptance of sharing and use of blended data will change. For these reasons, policy frameworks for decision making regarding disclosure risk/usefulness trade-offs need to be dynamic.
Conclusion 3-1: The effectiveness of a framework for making decisions about acceptable disclosure risks given expected usefulness of data depends on whether that framework is dynamic. A dynamic framework allows for changing policy needs and data availability over time, in a way that accounts for the interests of data subjects, data holders, and data users.
The degree of acceptable disclosure risks is a policy decision. As uses and users of blended data may have differing needs, agencies can establish tiered access protocols, describing levels of potential disclosure risks, disclosure harms, and data usefulness as well as the procedures in place to regulate data access. The Foundations for Evidence-Based Policymaking Act of 2018 requires the establishment of such standards to describe levels of data asset sensitivity and corresponding access (2019).5 Several recent reports, including those from the Advisory Committee on Data for Evidence Building (2022) and National Academies (2023b,c), have called for tiered access as a way to further promote data sharing with trusted data users, such that technical and policy approaches to limit disclosure risks are commensurate with disclosure harms and anticipated data usefulness. The inclusion and exclusion criteria proposed to designate such tiers would likely need to vary upon implementation to reflect differences in acceptable risk given agency policy, resources, and data characteristics.
Conclusion 3-2: Tiered access for data users and agencies is a key component of a dynamic disclosure risk/usefulness trade-off framework, to reflect differences in acceptable risks given policy priorities.
Given recent sweeping changes to data-access laws and agency responsibilities, and the increased disclosure risks (and potential usefulness) of
___________________
5 See 44 § 3582(b)(1) “Standards for each statistical agency or unit to assess each data asset owned or accessed by the statistical agency or unit for purposes of categorizing the sensitivity level of each such asset and identifying the corresponding level of accessibility to each such asset.”
blended data, it is essential for agencies engaged in statistical, data science, and evidence-building activities to coordinate policy approaches to facilitate best practices for risk management. Involving stakeholder groups at each stage of the blended data lifecycle enables better decision making when managing trade-offs and improves engagement with affected groups.
A productive dialogue among data holders and data users across disciplines requires a shared language reflecting the concepts of risk, harm, and usefulness. Shared language also enables quantification of these concepts, enabling them to be considered when managing trade-offs. Technical approaches that meet particular privacy and confidentiality requirements established by policy can provide a way forward.
Conclusion 3-3: A common, cross-disciplinary language and lexicon describing privacy and confidentiality risks and harms, as well as data usefulness, is needed. Interpretable and measurable terms can promote meaningful discussions among stakeholders, including data subjects and decision makers.
As discussed in Chapters 2 and 3, determining appropriate access for a blended data product6 cannot be answered simply as “yes/no.” Instead, this determination depends on the acceptability of the disclosure risks for the blended data product, given anticipated usefulness and potential harm. Therefore, decisions about privacy- and confidentiality-protection methods are necessarily context specific and driven by the disclosure risk/usefulness trade-offs for the problem at hand (see Figure 4-1).
However, there are common decision points in data-blending scenarios. Drawing from the panel’s review of technical and policy approaches described in Chapters 2 and 3, Chapter 4 presents a framework for making decisions about data-protection methods. The framework encourages agencies to answer a set of questions at each stage of the data-blending lifecycle, with the goal of arriving at technical and policy approaches to manage disclosure risk/usefulness trade-offs. The framework does not cover all data-blending situations and challenges, nor does it stipulate precise technical or policy approaches for certain data-blending tasks. In the panel’s opinion, the framework is most useful as a lens that prompts careful consideration of key questions rather than as a static structure attempting to provide specific answers that could become out of date. The proposed framework
___________________
6 For example, whether a data product resulting from blending of restricted-use and public-use data can be sufficiently treated for disclosure risks and subsequently made available as a public-use file.
is included in Box S-1. Agencies need to leverage both policy approaches and technical approaches based on the best-available science to facilitate blending of confidential data.
Conclusion 4-1: Technical and policy approaches in combination are necessary for effective management of disclosure risks.