Previous Chapter: 5 Conclusion
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

Glossary of Selected Terms

Advisory Committee on Data for Evidence Building Established as part of the Foundations for Evidence-Based Policymaking Act of 2018 (2019) “to review, analyze, and make recommendations to the White House Office of Management and Budget [...] Director on how to promote the use of federal data for evidence building” (Advisory Committee on Data for Evidence Building, 2021, p. 1).
Blended data lifecycle The process of designing, constructing, analyzing, and sharing blended data.
Blended data product Combined sources of previously collected data, including data in nontabular formats.
Commission on Evidence-Based Policymaking The Evidence-Based Policymaking Commission Act of 2016 created the Commission on Evidence-Based Policymaking. The Commission was tasked to examine ways to increase the availability and use of government data to build evidence while protecting data privacy and confidentiality (Evidence-Based Policymaking Commission Act, 2016). The Commission’s report (Commission on Evidence-Based Policymaking, 2017) informed the Foundations for Evidence-Based Policymaking Act of 2018 (2019).
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Composition The cumulative disclosure risks that result from multiple releases of confidential information, even when disclosure-avoidance methods are applied. Disclosure risk assessments of each data release in isolation may be insufficient to estimate overall disclosure risks. Disclosure risk assessment methods that do allow estimations of overall risks from separate analyses are said to compose well. Currently, only differential privacy and its common variations are known to compose well (Fluitt et al., 2019; Ganta et al., 2008).
Confidentiality “[P]reserving authorized restrictions on access and disclosure, including means for protecting personal privacy and proprietary information” (E-Government Act, 2002).1
Data enclave A physical or virtual segregated space that “restricts the export of the original data and instead accepts queries from qualified researchers, runs the queries on the de-identified data, and responds with results. Enclaves can be physical or virtual and can operate under a variety of different models. For example, vetted researchers may travel to the enclave to perform their research, as is done with the Federal Statistical Research Data Centers operated by the U.S. Census Bureau. Enclaves may be used to implement the verification step of the synthetic data with validation model. Queries made in the enclave model may be vetted automatically or manually” (Garfinkel et al., 2023, p. 33).
Data equity Data that “[…] allow for rigorous assessment of the extent to which programs and policies yield consistently fair, just, and impartial treatment of all individuals. Equitable data illuminate opportunities for targeted actions that will result in demonstrably improved outcomes for underserved communities” (Equitable Data Working Group, 2022, p. 3).

___________________

1 See § 3542(a)(1)(B).

Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Data holders Organizations that hold information of possible use in a national data infrastructure. These include federal, state, local, and tribal agencies, and other public and private-sector organizations (National Academies of Sciences, Engineering, and Medicine, 2023c).
Data infrastructure This term includes data assets; the technologies used to discover, access, share, process, use, analyze, manage, store, preserve, protect, and secure those assets; the people, capacity, and expertise needed to manage, use, interpret, and understand data; the guidance, standards, policies, and rules that govern data access, use, and protection; the organizations and entities that manage, oversee, and govern the data infrastructure; and the communities and data subjects whose data are shared and used for statistical purposes and may be impacted by decisions made using those data assets (National Academies, 2023c).
Data minimization A disclosure risk management approach that emphasizes sharing only the data necessary for the data product purpose (Commission on Evidence-Based Policymaking, 2017, p. 49).
Data subjects The people, entities, or organizations described by data files (National Academies, 2023c).
Data visitation An approach to data management in which sensitive data stay under the control of the data holder, and which allows data users (e.g., analysts or applications) to come to the data to work with them. This differs from settings in which data users can store, retrieve, move, or manipulate stored data (Hanisch et al., 2021; Weise et al., 2022).
Deidentification A general term for any process of removing the association between a set of identifying data and the data subject (Garfinkel et al., 2023).
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Differential privacy A rigorous mathematical definition of disclosure that considers the increase in accuracy with which an individual’s confidential data may be estimated as a result of a mathematical analysis based on those data being made publicly available (Garfinkel et al., 2023, p. 15).
Disclosure When confidential information is learned from a data release. Identity disclosure (or identification) occurs when “[a] respondent is linked to a particular record in a released file. Identification, sometimes called reidentification, is equivalent to inadvertent release of an identifiable record. With microdata, only respondents whose records are released can be correctly reidentified. Identifications are also possible from tabular data and inquiries about groups, however” (Lambert, 1993, p. 315). Attribute disclosure “occurs when the intruder believes something new has been learned about the respondent. An attribute disclosure may occur with or without an identification” (Lambert, 1993, p. 315).
Disclosure harms The negative consequences that result from disclosures, which can be to data subjects or data holders.
Disclosure risks The possibility that confidential information might be learned from a data release.
Evidence Act The Foundations for Evidence-Based Policymaking Act of 2018. This statute requires agency data to be accessible and requires agencies to plan to develop statistical evidence to support policymaking (Foundations for Evidence-Based Policymaking Act of 2018, 2019).
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Federal statistical agencies The U.S. principal federal statistical agencies are Bureau of Economic Analysis (Department of Commerce); Bureau of Justice Statistics (Department of Justice); Bureau of Labor Statistics (Department of Labor); Bureau of Transportation Statistics (Department of Transportation); Census Bureau (Department of Commerce); Economic Research Service (Department of Agriculture); Energy Information Agency (Department of Energy); National Agricultural Statistics Service (Department of Agriculture); National Center for Education Statistics (Department of Education); National Center for Health Statistics (Department of Health and Human Services); National Center for Science and Engineering Statistics (National Science Foundation); Office of Research, Evaluation, and Statistics (Social Security Administration); and Statistics of Income (Department of Treasury). There are also three recognized federal statistical units: Microeconomic Surveys Unit (Federal Reserve Board), Center for Behavioral Health Statistics and Quality (Substance Abuse and Mental Health Services Administration, Department of Health and Human Services), and National Animal Health Monitoring System (Animal and Plant Health Inspection Service, Department of Agriculture); (National Academies, 2021a).
Fitness for use First coined by Juran (1988), an approach to evaluating data that emphasizes consideration of the needs of the user for a particular use, to determine quality (Neely, 2005).
Formal privacy A general term that includes certain types of disclosure-protection methods, notably differential privacy and its variants. There are four criteria to be met: (a) it models the process of creating data products, regardless of what specific format they take; (b) its confidentiality guarantee is invariant to postprocessing—for example, statistical analysis of the data does not decrease protections; (c) it supports composition; and (d) it supports transparency.
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Harm See Disclosure harms.
Informed consent The communication to human subjects of the purposes, potential usefulness, and potential risks of involvement in a data collection. This includes the voluntary nature of participation; alternative procedures available; the extent, if any, to which confidential records identifying the subject will be maintained; and how to obtain more information about the study (Basic HHS Policy for Protection of Human Research Subjects, 2018).
Ingredient data files The datasets comprising the elements that feed into the blending procedure.
Metadata “Information describing the characteristics of data including, for example, structural metadata describing data structures (e.g., data format, syntax, and semantics) and descriptive metadata describing data contents (e.g., information security labels)” (Johnson et al., 2016).
Output data files The end products produced from blended data (Varia, 2023).
Personally identifiable information “Any representation of information that permits the identity of an individual to whom the information applies to be reasonably inferred by either direct or indirect means” (Ferraiolo et al., 2015, p. 46).
Policy approaches Approaches to establish rules for individuals and organizations to access data. They include, for example, laws, regulations, incentives for responsible data use, and access mechanisms like data enclaves and licensing arrangements.
Postprocessing invariance A requirement in disclosure management approaches that making disclosure-avoidance plans available to the public does not increase disclosure risks. Instead, the disclosure risk measure selected needs to account for this knowledge (Cohen, 2022; Wong et al., 2007).
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Privacy “Privacy refers to freedom from intrusion into one’s personal matters and personal information” (National Academies, 2023a, p. 75).
Privacy-loss budget Parameters in differential privacy (or its variants). The parameters measure the disclosure risks of a set of statistics generated from a differentially private algorithm. Larger privacy-loss budgets indicate potentially greater chances that the released statistic, in combination with others previously released, results in a disclosure. Differential privacy methods need to be interpreted based on best practices for the specific variant of differential privacy being used.
Privacy-preserving record linkage A class of techniques for linking records across two or more databases that use secure protocols to avoid directly sharing records’ identifiers (Hall & Fienberg, 2010, p. 277).
Private set intersection A type of secure multiparty computation that can be used to share information needed for linking records across ingredient data files (Freedman et al., 2004).
Query auditing A process that tracks which data a researcher accesses and checks access patterns for potential disclosure of confidential information (Nabar et al., 2008).
Reclaimed data Data used for purposes other than specified at the time of collection—whether those purposes were specified formally through informed consent, or through a public perception of use for administrative records.
Reidentification A process by which information is attributed to deidentified data to identify the individual to whom the deidentified data relate (Garfinkel et al., 2023).
Risk See Disclosure risks.
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Secure multiparty computation “A computerized system that enables different participating entities in possession of private sets of data to link and aggregate their data sets for the exclusive purpose of performing a finite number of pre-approved computations without transferring or otherwise revealing any private data to each other or anyone else” (Student Right to Know Before You Go Act, 2022).2
Stakeholder “A person or organization who has a vested interest in the activities, a decision, or an outcome of the agency. A stakeholder may be a program beneficiary, decision-maker, partner, researcher, or even a program implementer” (The Data Foundation and the Center for Open Data Enterprise, 2023, p. 4).
Standard Application Process A uniform method for accessing federal confidential data assets to systematically provide permission to use protected data from any of the 16 federal statistical agencies and designated units for evidence building (Office of Management and Budget, 2022).
Synthetic data “[A] dataset that is similar to the original [sensitive] data, but where some or all of the resulting data elements are generated and do not map to actual individuals” (Garfinkel et al., 2023, p. 8).
Technical approaches Methods for disclosure risk limitation that operate on the data values to reduce disclosure risks. These include data perturbation methods and secure multiparty computation.
Transparency Also known as Kerckhoffs’s principle, a disclosure-management requirement stating that implementation details of a disclosure-avoidance mechanism, along with its privacy parameters, be made public. See Postprocessing invariance.
Usefulness The value of a data product for its intended purpose. Usefulness can have multiple dimensions, including statistical accuracy of estimates, fitness for use, accessibility of data products, and data equity.

___________________

2 See § 2(12).

Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Validation server A disclosure management approach in which the secondary data analyst requests that the agency run its statistical analysis on confidential data, and the agency provides outputs from the analysis, with disclosure protection, back to the analyst.
Verification server A disclosure management approach in which the agency provides disclosure-protected measures of the similarity of the confidential and synthetic data results (e.g., how much confidence intervals from the confidential and synthetic data analyses overlap) back to the analyst.
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.

This page intentionally left blank.

Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 97
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 98
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 99
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 100
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 101
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 102
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 103
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 104
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 105
Suggested Citation: "Glossary of Selected Terms." National Academies of Sciences, Engineering, and Medicine. 2024. Toward a 21st Century National Data Infrastructure: Managing Privacy and Confidentiality Risks with Blended Data. Washington, DC: The National Academies Press. doi: 10.17226/27335.
Page 106
Next Chapter: References
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.