Previous Chapter: Summary
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

1

Introduction

National statistical agencies, researchers, the private sector, and the public increasingly rely on blended data to produce the evidence needed for informed policymaking and to understand various aspects of society. Blended data (i.e., data combined from multiple sources of previously collected data) have numerous benefits. They can enable rich analyses impossible with any one dataset alone; increase the accuracy, granularity, and timeliness of analyses; improve data equity; and reduce response burden and cost to the public. The benefits are increasingly attainable as computational and statistical methodologies advance, making data integration feasible even with massive and distributed data sources. Indeed, the central role of blended data for evidence building has been a focus of recent federal legislation, regulation, and guidance seeking to clarify key roles and responsibilities in the stewardship of blended data. Yet, despite opportunity and progress, many challenges remain as the United States strives to achieve the promise of a modern national data infrastructure based on blended data.

To advance the national conversation on confronting these challenges, the Committee on National Statistics (CNSTAT) in the Division of Behavioral and Social Sciences and Education of the National Academies of Sciences, Engineering, and Medicine convened three panels under the collective title Toward a Vision for a New Data Infrastructure for Federal Statistics and Social and Economic Research in the 21st Century. Each panel comprised experts in statistics, economics, social science research, survey methodology, privacy, public policy, and computer science. The work was funded by the National Science Foundation, with the goal of producing three reports. The first report, Toward a 21st Century National Data

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Infrastructure: Mobilizing Information for the Common Good, provides an overall vision for a new national data infrastructure. It describes the key role of blended data in realizing the vision and notes both the many benefits and remaining challenges involved. The second report, Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources, describes the ways blended data can support survey programs and the policies they are meant to inform.

These prior reports identify a crucial challenge inherent to creating and sharing blended data products, namely that blending data “raises new concerns about protecting privacy, which may affect decisions about data access” (National Academies of Sciences, Engineering, and Medicine, 2023a, p. 60). Accordingly, the Statement of Task for this third panel, the Panel on Approaches to Sharing Blended Data in a 21st Century Data Infrastructure, directed the panel to identify aspects of sharing and analyzing private and confidential data that can be addressed by technical approaches; to identify corresponding research gaps; and to identify aspects that may require policy approaches. In addition, the Statement of Task directed the panel to develop frameworks for designing and evaluating integrated technical and policy approaches that can guide best practices for sharing, using, and analyzing blended private and confidential data. Furthermore, the frameworks are to be illustrated using selected case studies and assessed for fit to intended outcomes of a new data infrastructure. Last, the Statement of Task directed the panel to identify areas in which approaches are not yet known.

To meet this charge, a diverse panel with expertise spanning statistics, survey methodology, demography, disclosure analysis, equity analysis, information security, privacy law, and public policy was formed to study these issues. The panel convened a virtual public workshop to seek input from external experts about ways that researchers and government agencies manage disclosure risks when building and sharing blended data. The workshop informed this report, Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. In this third report of the series, the panel discusses specific privacy and confidentiality concerns with blended data, describes approaches to managing those concerns, and highlights areas in which more research is needed.

To set the stage, we begin by summarizing the critical role that technical and policy approaches to protecting privacy and confidentiality play in any data-sharing setting, regardless of whether blended data are involved. Blended data can magnify privacy and confidentiality concerns, as we mention here and describe in detail in later chapters.

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

PRIVACY AND CONFIDENTIALITY IN DATA USED FOR EVIDENCE-BASED POLICYMAKING

Disclosure Risks and Disclosure Harms

Often, data holders “[…] are obligated ethically and often legally to protect the [privacy and] confidentiality of data subjects’ identities and attributes” (Karr & Reiter, 2014, p. 276). Failure to do so can have serious consequences for data subjects, whose personal information is disclosed without permission, as well as for agencies and researchers, who could be sued and/or fined as a result (Confidential Information Protection and Statistical Efficiency Act, Title 13, and Title 26) and whose reputations as trusted data collectors and analysts could be damaged. Threats to privacy and confidentiality can arise at various stages of the data-collection lifecycle, including when agencies first construct data, when they share data beyond the original data holders, and even when they disseminate analysis results. For example, when agencies seek to share record-level data with the public, it is well known that stripping direct identifiers, like names and addresses, is necessary but insufficient for confidentiality protection. Ill-intentioned users (henceforth referred as adversaries throughout this report) may be able to link records in the released file to records in external files, by matching on characteristics common to both files (Gymrek et al., 2013; Narayanan & Shmatikov, 2006; Sweeney, 1997, 2000; Sweeney et al., 2013). As another example, when agencies release large numbers of analysis results, such as many summary counts, adversaries may be able to reconstruct the underlying record-level data by solving systems of equations (Dinur & Nissim, 2003; Dwork et al., 2017; Garfinkel et al., 2018).

It is useful to separate the concepts of disclosure risks and disclosure harms. The former refers to the risk that adversaries learn the identities or attributes of individuals in the confidential data from information released by an agency (Lambert, 1993; Skinner et al., 2012). The latter refers to any damages, to the data subjects or the agency, by the reveal of confidential information to the adversary. For example, a data-release strategy may not sufficiently prevent an adversary from learning a data subject’s age, but the data subject might be only minimally damaged by the reveal of such information, depending on the circumstances. This would represent a disclosure with low harm. On the other hand, a data-release strategy that allows an adversary to learn sensitive health information could be a disclosure with significant harm. There can also be specific forms of disclosure risks and disclosure harms that affect individuals and groups (Afnan et al., 2022), data holders (Wong et al., 2023), and statistical agencies (Lambert, 1993).

The literature on disclosure risks suggests that they can be challenging to measure (Keller et al., 2016; Reiter, 2005a; Willenborg & de Waal,

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

2001). Classical approaches to disclosure risk assessment consider the threats from certain types of adversaries and disclosure-attack scenarios. For example, agencies may attempt to determine if sampled individuals are unique in the population on a set of measured key variables, such as demographic variables, and consider any such unique records at high risk of disclosure (Elamir & Skinner, 2004; Skinner & Shlomo, 2008). Alternatively, agencies may attempt to match records in a released data file to records in external databases, considering linked records to be at high risk of disclosure (Domingo-Ferrer et al., 2016). Of course, agencies could incorrectly specify the adversary’s information set or attack strategy, in which case the risk evaluation may present an incomplete picture of the disclosure risks. Unfortunately, it is generally difficult for agencies to know what information adversaries possess or how that information correlates with the data being protected, especially in blended data tasks in which confidential information comes from multiple data sources. Furthermore, disclosure risks can accumulate from multiple releases. This is particularly salient for blended data, as individual data holders may release their own products in addition to any blended data. Indeed, the ability to account for cumulative information leakage is one reason several statistical agencies are turning to differential privacy as a disclosure risk criterion.

Technical and Policy Strategies for Reducing Disclosure Risks and Harms

Typically, agencies take steps to reduce disclosure risks and some types of harm. In this report, we consider both technical approaches and policy approaches. By technical approaches, we mean methods from statistical science, computer science, mathematics, and other disciplines. These include, for example, statistical disclosure-control techniques like data perturbation (Hundepool et al., 2012), synthetic data (Little, 1993; Reiter, 2003; Rubin, 1993), differentially private algorithms (Dwork & Roth, 2014), and SMC (Cramer & Damgård, 2015; Garfinkel, 2015; Garfinkel et al., 2023). By policy approaches, we mean legal and regulatory requirements imposed on the use of the data. These include, for example, laws or coordination mechanisms that facilitate interagency data sharing, data-use agreements between agencies and users, and penalties for violations of data-use requirements. We also classify secure data enclaves (physical or remote) and other data-access restrictions as policy approaches. Often, agencies use both types of strategies (Duncan et al., 2011). For example, in tiered access approaches (National Academies, 2023b; National Research Council, 2005), an agency might provide public-use access to a synthetic data file coupled with remote or physical enclave access for approved users, and also apply disclosure-protection methods and auditing procedures to outputs from the enclave.

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

In Chapter 2, we discuss how some current technical approaches can be used effectively for certain data-blending tasks, whereas others are insufficient or need additional research to be broadly implemented. We also describe how blended data can have greater risks of disclosure than single-source data and how these risks can appear throughout the lifecycle of blended data, from construction to dissemination. In Chapter 3, we focus on opportunities and gaps in current legal and regulatory frameworks intended to facilitate data blending and suggest characteristics for such frameworks going forward. Building on the findings from Chapters 2 and 3, Chapter 4 presents a model framework for decision making about disclosure risk/usefulness trade-offs when preparing and sharing blended data. Chapter 5 identifies areas in which additional research is needed to support technical and policy approaches to managing disclosure risks, as well as areas that warrant future in-depth study.

Characterizing Usefulness of Blended Data Can Assist Decision Making

Any data-release method that produces useful data inherently comes with some nonzero risk to privacy and confidentiality. Even when agencies use and apply the best-available methods, some disclosure risks remain. Technical and policy approaches necessarily involve trade-offs among disclosure risks, disclosure harms, and data usefulness (see Chapters 2 and 3). For example, agencies typically perturb or otherwise limit access to confidential data to reduce disclosure risks. Both methods can affect the quality of data and their usefulness to the public. Thus, a central question arises: How much risk can be accepted for the usefulness resulting from increased data quality and ease of access?

While answering this question clearly involves assessing the potential for disclosure risks and harms, evaluating the usefulness of a proposed data release is equally important. A data-protection strategy is arguably pointless when it results only in unreliable statistical inferences. With federal data in particular, the mandate for usefulness is clear. Federal data are collected to provide a public good.1 They contribute to the foundation of a democratic society in which public policy is informed by data (National Academies, 2021a). When making disclosure risk/usefulness trade-offs, it is beneficial to prioritize blended data products that could be leveraged to address public policy issues or to directly improve social well-being.

___________________

1 The panel recognizes that the public provides data to the federal government through a variety of methods, including mandatory and voluntary surveys and administrative records. The federal government collects this information to provide a public good—to individuals and to groups. Use of data for objectives other than the original purpose for which they were collected is a complex issue and, in the panel’s view, warrants focused study.

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

The data-confidentiality literature describes several other dimensions of data usefulness (Reiter, 2012; Wu, 2013).2 One dimension falls under the general topic of data quality: the released data allow analysts to accurately learn about the true phenomena or finite population characteristics under study. Note that learning about quantities in the population is not necessarily the same as learning about quantities from confidential data, as the latter may be subject to a variety of sampling and nonsampling errors. Indeed, several researchers have found that data perturbations to protect confidentiality can have less impact on conclusions from data analyses than measurement and nonresponse errors can (Michler et al., 2022; Steed et al., 2022).

Conceptually, dimensions of quality—and expectations for conformance—have been well established in federal guidance.3 Generally, quality encompasses elements of relevance, accuracy, and objectivity, and describes data obtained from data subjects and data holders in the context of a relationship of trust. Evaluating data quality or fitness for use is context specific, reflecting the needs of the data user and the purposes of use (Juran, 1988; Neely, 2005). Data equity is another critical dimension of quality. An agency’s privacy-protection methods could differentially affect the quality of data from various groups as well as impose greater privacy risks for certain groups.

Accessibility is a less obvious aspect of usefulness. The intent behind laws and regulations like the Foundations for Evidence-Based Policymaking Act of 2018 (hereafter, Evidence Act) is to improve access to data, including enabling blended data for statistical purposes (Foundations for Evidence-Based Policymaking Act of 2018, 2019). Therefore, access for research and policy evaluation can be one component of usefulness. Some experts have argued that datasets that are used by many people, or used to make a large number of decisions, are inherently of higher usefulness than datasets with limited purposes (Lane et al., 2023). Furthermore, understanding more about how and for what purpose datasets are accessed can help agencies improve both their data products and outreach (Lane et al., 2022; Potok, 2023).4 Finally, agencies that provide access to informative data products

___________________

2 See Chapter 3 for a more thorough discussion of data usefulness.

3 See, for example, the Information Quality Act (also known as the Data Quality Act [Consolidated Appropriations Act, 2000]); Federal Information Quality Guidelines (Office of Management and Budget, 2002); Statistical Policy Directive 1 (Office of Management and Budget, 2014); and Principles and Practices for Federal Statistical Agencies (National Academies, 2021a).

4 An example of this approach is The 5 Ws of NASS Data Usage Dashboard, which uses data downloaded from the Democratizing Data Platform to describe the who, what, when, where, and why in terms of the data being used (National Agricultural Statistics Service, 2023).

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

while protecting privacy and confidentiality can enhance their reputations as trusted promoters of societal betterment.

EXISTING FEDERAL REGULATIONS AND GUIDELINES

Ultimately, disclosure risk/usefulness trade-offs are policy decisions, with potential input from agencies contributing ingredient data, data subjects, subject matter experts, and data users for the specific context at hand. Nonetheless, extant federal policy guidelines provide a foundation to guide agency and researcher decisions when evaluating trade-offs between disclosure risks, harms, and data usefulness. Examples include the following:

  • Fair Information Practice Principles (FIPPs) describe initial standards for guiding federal management of private, personally identifiable information. They formed the basis for privacy policy across federal agencies for decades (Teufel III, 2008). FIPPs were foundational in the formulation of the Privacy Act of 1974. One noteworthy feature of FIPPs is that they pair increased data access with increased obligations to uphold confidentiality.
  • Privacy Impact Assessments (PIAs) are required when federal agencies develop or procure new information technology involving the collection, maintenance, or dissemination of information in identifiable form or when making substantial changes to existing information technology that manages information in identifiable form. Particularly relevant to the discussion of blended data, PIAs need to be reassessed when “agencies adopt or alter business processes so that government databases holding information in identifiable form are merged, centralized, matched with other databases or otherwise significantly manipulated” (Office of Management and Budget, 2003, p. 4).
  • The Five Safes framework has been described as “a set of principles which enable data services to provide safe research access to data” (Desai et al., 2016, p. 4). Each “safe” (safe projects, safe people, safe settings, safe data, and safe outputs) refers to a separate but related aspect of disclosure risks. The Five Safes “seeks to provide a common language and framework for data access irrespective of the particular circumstances” (Desai et al., 2016, p. 4; Ritchie & Green, 2020).
  • Statistical Policy Directive 1, also known as the Trust Directive, was issued under the authority of the U.S. Chief Statistician. It articulates long-held expectations for federal statistical agencies and designated statistical units (Office of Management and Budget,
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
  • 2014). In 2019, Statistical Policy Directive 1 was embedded within § 3563 of the Evidence Act.5
  • De-Identification of Personal Information and De-Identifying Government Datasets, issued by the National Institute for Standards and Technology, provide overviews of deidentification issues and describe the use of technical approaches and reviews to limit disclosure risks while still allowing for the production of meaningful statistical analysis (Garfinkel, 2015; Garfinkel et al., 2023).
  • Circular A-130, issued by the Office of Management and Budget, “[…] establishes general policy for the planning, budgeting, governance, acquisition, and management of Federal information, personnel, equipment, funds, information technology resources and supporting infrastructure and services. The appendices to this Circular also include responsibilities for protecting Federal information resources and managing personally identifiable information” (Office of Management and Budget, 2016, p. 2).
  • Data Ethics Tenets were developed in response to the 2019 U.S. Federal Data Strategy and its 2020 Action Plan to advance access, interoperability, and usefulness of federal data. The Data Ethics Tenets are intended to help agency employees, managers, and leaders make ethical decisions as they acquire, manage, and use data throughout the data lifecycle (General Services Administration, 2020).
  • Principles and Practices for Federal Statistical Agencies, also known as “the purple book,” is issued by the National Academies’ CNSTAT to communicate to policymakers “characteristics of statistical agencies that enable them to serve the public good” (National Academies, 2021a, p. 5). To maintain its relevance, the report is reviewed and revised as necessary to reflect changes to national data infrastructure as well as legal and regulatory frameworks.
  • Information Security, Cybersecurity and Privacy Protection—Privacy Enhancing Data De-identification Framework was issued by the International Organization for Standardization and the International Electrotechnical Commission. It “provides a framework for identifying and mitigating re-identification risks and risks associated with the lifecycle of de-identified data” (International Organization for Standardization, 2022, p. 1).
  • Data Protection Toolkit: Report and Resources on Statistical Disclosure Limitation Methodology and Tiered Data Access (formerly Statistical Policy Working Paper #22) was issued by the Federal

___________________

5 At the time of this report’s writing, a rule has been proposed to implement the Trust Directive as a key component of the Evidence Act. See Office of Management and Budget (2023).

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
  • Committee on Statistical Methodology. Adopting the general framework of The Five Safes, the Data Protection Toolkit updates Statistical Policy Working Paper #22 to account for changes in federal data access policy instituted by the Evidence Act, as well as the proliferation of public data assets and advances in computing capability (Federal Committee on Statistical Methodology, 2022). Reorganizing and updating Statistical Policy Working Paper #22 as a toolkit reflects the dynamic nature of disclosure-limitation practices. Envisioned as a “living document,” the toolkit “[…] includes an online repository of resources to assist agencies with the implementation of these statistical techniques including checklists, risk frameworks, templates, and automated tools” (Federal Committee on Statistical Methodology, 2022, p. 1).

The 2019 U.S. Federal Data Strategy (M-19-18) urges federal agencies to “plan for secondary data uses from the outset, through reidentification risk assessments, stakeholder engagement, and sufficient information to assess fitness for use” (Office of Management and Budget, 2019, p. 1). At the time of this report’s writing, a proposed rule to strengthen the foundation of federal data governance has underscored the prominence of M-19-18 for providing guidance, particularly the expectation for agencies to anticipate and plan for future use of collected data (Office of Management and Budget, 2023). Prior National Academies reports (National Academies, 2017a,c, 2020; National Research Council, 2005) began describing these issues in broad terms as well. This report fits squarely with these others, focusing explicitly on the trade-offs among disclosure risks, harms, and data usefulness, specifically for blended data.

STATEMENT OF TASK

Box 1-1 displays the Statement of Task for each of the three panels in the Toward a Vision for a New Data Infrastructure for Federal Statistics and Social and Economic Research in the 21st Century series. Each panel was charged with convening a workshop on aspects of a vision for a new data infrastructure and writing a consensus panel report on those aspects.

The first panel’s virtual workshop, The Scope, Components, and Key Characteristics of a 21st Century Data Infrastructure, held on December 9 and 16, 2021,6 provided an overview of current data-infrastructure

___________________

6 The virtual workshop video and slides can be found at https://www.nationalacademies.org/event/12-09-2021/the-scope-components-and-key-characteristics-of-a-21st-century-data-infrastructure-workshop-1a and https://www.nationalacademies.org/event/12-16-2021/the-scope-components-and-key-characteristics-of-a-21st-century-data-infrastructure-workshop-1b

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

initiatives as well as the use of private-sector data by international and federal statistical agencies. Participants from federal agencies and nonprofit organizations described their experiences in acquiring and using private-sector health data, as well as the current state of data sharing among private companies, researchers, and the federal government. The workshop informed the panel’s report, Toward a 21st Century National Data Infrastructure: Mobilizing Data for the Common Good (National Academies, 2023c), which provides a vision and roadmap for building a 21st century national data infrastructure, including core components of governance; the capabilities, techniques, and methods required; and the sharing of data assets (e.g., federal, state, and local government; institutional; and private-sector data). The report describes how the United States can improve statistical information critical to shaping the nation’s future by mobilizing data assets and blending them with survey data.

The second panel’s virtual workshop, The Implications of Using Multiple Data Sources for Major Survey Programs, held on May 16 and 18, 2022,7 described current practices and potential for using data originating from administrative records, private-sector organizations, sensors and satellites, and other sources, to enhance the timeliness, detail, and accuracy of information collected through surveys. The workshop informed the panel’s report, Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources (National Academies, 2023a), which notes potential trade-offs between preserving individuals’ privacy and the confidentiality of their information versus data usefulness and data equity. Furthermore, the report notes the complexities of informed consent when sharing data initially collected for a particular use. That panel found that the use of multiple data sources can promote data equity through more accurate representation of population subgroups that have historically been underrepresented or misrepresented in the data ecosystem.

APPROACH

As noted in Box 1-1, this third report offers conclusions regarding the technical and policy approaches available to manage the disclosure risk/usefulness trade-offs necessary for blended data.8 This report builds upon prior work examining considerations for sharing blended data, including the U.S. Commission on Evidence-Based Policymaking (2017) report, the

___________________

7 The virtual workshop video and slides can be found at https://www.nationalacademies.org/event/05-16-2022/the-implications-of-using-multiple-data-sources-for-major-survey-programs-workshop

8 The Statement of Task for this report was adjusted in 2022 according to National Academies procedure, to account for the findings and conclusions in the first two reports in the series.

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Advisory Committee on Data for Evidence Building (2021, 2022) reports, and prior CNSTAT reports.9 As with the prior two panels in this series, the panel convened a virtual workshop, Approaches to Sharing Blended Data in a 21st Century Data Infrastructure, on May 22, 23, and 25, 2023,10 to gather input from practitioners and researchers regarding their experiences, best practices, challenges, and potential approaches to protecting privacy throughout the blended data lifecycle. Panel members selected case studies to be featured during the workshop event, to illustrate differences in (a) intended uses of blended data; (b) sources of ingredient data files; (c) technical and policy approaches applied to manage disclosure risks; and (d) potential usefulness of analyses and harm from disclosure. In addition, the workshop featured presentations from thought leaders who shared their views on disclosure risk/usefulness trade-offs and communicating with stakeholders. The workshop informed the findings and conclusions reached in this report.

As part of this report, we propose a model framework for decision making about managing disclosure risk/usefulness trade-offs in blended data products. The framework is intentionally and necessarily general. As described throughout this report, generation of ingredient and blended data, advancements in technical approaches, policy priorities, and regulations continue to evolve. So, too, does social acceptance of use of and access to blended data. Even if these parameters were static, the specific trade-offs involved in calibrating privacy protections considering disclosure risks, harms, and data usefulness depend upon the specific ingredient data and the intended use of the blended data. Therefore, in the panel’s opinion, the framework is most useful as a lens that prompts careful consideration of

___________________

9 These include (but are not limited to) the following: Private Lives and Public Policies: Confidentiality and Accessibility of Government Statistics (National Research Council, 1993); Modernizing Crime Statistics (National Academies, 2016, 2018); Innovations in Federal Statistics: Combining Data Sources While Protecting Privacy (National Academies, 2017a); Federal Statistics, Multiple Data Sources, and Privacy Protection: Next Steps (National Academies, 2017c); Improving Crop Estimates by Integrating Multiple Data Sources (National Academies, 2017b); A Satellite Account to Measure the Retail Transformation: Organizational, Conceptual, and Data Foundations (National Academies, 2021b); A Vision and Roadmap for Education Statistics (National Academies, 2022a); Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies (National Academies, 2022b); Modernizing the Consumer Price Index for the 21st Century (National Academies, 2022c); Toward a 21st Century National Data Infrastructure: Mobilizing Data for the Common Good (National Acadamies, 2023c); and Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources (National Academies, 2023a).

10 The virtual workshop video and slides can be found at https://www.nationalacademies.org/event/05-22-2023/approaches-to-sharing-blended-data-in-a-21st-century-data-infrastructure-public-workshop-3-days-virtual

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

key questions rather than as a static structure attempting to provide specific answers that could become out of date.

There are also some topics important to the discussion of managing risk in blended data that this report acknowledges but does not address in depth. Rather than attempting to address these complex issues within the scope of this report, the panel believes that a focused study of each of these topics is warranted. Chapter 5 discusses these topics in greater detail.

In particular, the topic of informed consent is fundamental to the discussion of blended data. This report describes managing risk in blended data products that may be generated from ingredient data files collected for purposes other than specified at the time of collection—whether those purposes were specified formally through informed consent, or through common understanding of administrative record use (such as to receive federal benefits). As described in Chapter 3, the Evidence Act and its anticipated regulations are changing how federal data can be accessed and increasing opportunities for blended data products. These opportunities also bring additional responsibility to properly manage disclosure risks as public expectations about control and use of their data change. The anticipated presumed access regulation (described in detail in Chapter 3) may address some of these concerns, but it is likely that a fuller examination of the issues involved will be needed, to maintain and strengthen the public’s trust regarding the use of reclaimed data.

A second topic important to the discussion of managing risk in blended data products is the development and support of a skilled and stable research computing and data (RCD) workforce. Agencies need to recognize the growing need for RCD professionals given the volume of data, the rapid evolution of computing resources, and some data users’ lack of knowledge about or experience with emerging tools and techniques. As described in Chapter 5, agencies also face challenges in recruiting and retaining the RCD workforce, in part due to availability of appropriate job series, certification programs, and competitive starting salaries.

REPORT STRUCTURE

Chapter 2 identifies core concepts to inform decision making regarding sharing of blended data, reviews current technical approaches and their suitability for managing privacy and confidentiality risks in the blended data lifecycle, and describes needed improvements to measurement and methods to inform decision making.

Chapter 3 reviews current policy approaches and their suitability for managing risk in blended data; identifies areas in which policy approaches are particularly important when managing risk; identifies areas requiring

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

greater clarity in policy direction, roles, and responsibilities; and describes ways to manage risk.

Chapter 4 presents the model framework to assist decision making when managing privacy and confidentiality risks in blended data. Using selected case studies, it illustrates how the framework can be used to inform risk management in blended data.

Chapter 5 summarizes some key takeaways from the report and priority areas requiring further work. It also highlights two topics that, in the panel’s opinion, would benefit from dedicated, in-depth studies, namely informed consent for future data use and the development and support of the RCD workforce.

Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 9
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 10
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 11
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 12
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 13
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 14
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 15
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 16
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 17
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 18
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 19
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 20
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 21
Suggested Citation: "1 Introduction." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 22
Next Chapter: 2 Technical Approaches to Managing Risk When Sharing Blended Data
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.