Foundational Research Gaps and Future Directions for Digital Twins (2024)

Chapter: 4 The Physical Counterpart: Foundational Research Needs and Opportunities

Previous Chapter: 3 Virtual Representation: Foundational Research Needs and Opportunities
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

4

The Physical Counterpart: Foundational Research Needs and Opportunities

Digital twins rely on observation of the physical counterpart in conjunction with modeling to inform the virtual representation (as discussed in Chapter 3). In many applications, these data will be multimodal, coming from disparate sources, and of varying quality. Only when high-quality, integrated data are combined with advanced modeling approaches can the synergistic strengths of data- and model-driven digital twins be realized. This chapter addresses data acquisition and data integration for digital twins. While significant literature has been devoted to the science and best practices around gathering and preparing data for use, this chapter focuses on the most important gaps and opportunities that are crucial for robust digital twins.

DATA ACQUISITION FOR DIGITAL TWINS

Data collection for digital twins is a continual process that plays a critical role in the development, refinement, and validation of the models that comprise the virtual representation.

The Challenges Surrounding Data Acquisition for Digital Twins

Undersampling in complex systems with large spatiotemporal variability is a significant challenge for acquiring the data needed to characterize and quantify the dynamic physical and biological systems for digital twin development.

The complex systems that may make up the physical counterpart of a digital twin often exhibit intricate patterns, nonlinear behaviors, feedback, and emergent phenomena that require comprehensive sampling in order to develop an under-

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

standing of system behaviors. Systems with significant spatiotemporal variability may also exhibit heterogeneity because of external conditions, system dynamics, and component interactions. However, constraints in resources, time, or accessibility may hinder the gathering of data at an adequate frequency or resolution to capture the complete system dynamics. This undersampling could result in an incomplete characterization of the system and lead to overlooking critical events or significant features, thus risking the accuracy and predictive capabilities of digital twins. Moreover, undersampling introduces a level of uncertainty that could propagate through a digital twin’s predictive models, potentially leading to inaccurate or misleading outcomes. Understanding and quantifying this uncertainty is vital for assessing the reliability and limitations of the digital twin, especially in safety-critical or high-stakes applications. To minimize the risk and effects of undersampling, innovative sampling approaches can be used to optimize data collection. Additionally, statistical methods and undersampling techniques may be leveraged to mitigate the effects of limited data.

Finally, data acquisition efforts are often enhanced by a collaborative and multidisciplinary approach, combining expertise in data acquisition, modeling, and system analysis, to address the task holistically and with an understanding of how the data will move through the digital twin.

Data Accuracy and Reliability

Digital twin technology relies on the accuracy and reliability of data, which requires tools and methods to ensure data quality, efficient data storage, management, and accessibility. Standards and governance policies are critical for data quality, accuracy, and integrity, and frameworks play an important role in providing standards and guidelines for data collection, management, and sharing while maintaining data security and privacy (see Box 4-1). Efficient and secure data flow is essential for the success of digital twin technology, and research is needed to develop cybersecurity measures; methods for verifying trustworthiness, reliability, and accuracy; and standard methods for data flow to ensure compatibility between systems. Maintaining confidentiality and privacy is also vital.

Data quality assurance is a subtle problem that will need to be addressed differently in different contexts. For instance, a key question is how a digital twin should handle outlier or anomalous data. In some settings, such data may be the result of sensor malfunctions and should be detected and ignored, while in other settings, outliers may correspond to rare events that are essential to create an accurate virtual representation of the physical counterpart. A key research challenge for digital twins is the development of methods for data quality assessment that ensure digital twins are robust to spurious outliers while accurately representing salient rare events. Several technical challenges must be addressed here. Anomaly detection is central to identifying potential issues with data quality. While anomaly detection has been studied by the statistics and signal processing communi-

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

ties, unique challenges arise in the bidirectional feedback loop between virtual and physical systems that is inherent to digital twins, including the introduction of statistical dependencies among samples; the need for real-time processing; and heterogeneous, large-scale, multiresolution data. Another core challenge is that many machine learning (ML) and artificial intelligence (AI) methods that might be used to update virtual models from new physical data focus on maximizing average-case performance—that is, they may yield large errors on rare events. Developing digital twins that do not ignore salient rare events requires rethinking loss functions and performance metrics used in data-driven contexts.

A fundamental challenge in decision-making may arise from discrepancies between the data streamed from the physical model and that which is predicted by the digital twin. In the case of an erroneous sensor on a physical model, how can a human operator trust the output of the virtual representation, given that the supporting data were, at some point, attained data from the physical counterpart? While sensors and other data collection devices have reliability ratings, additional measures such as how reliability degrades over time may need to be taken into consideration. For example, a relatively new physical sensor showing different output compared to its digital twin may point to errors in the virtual representation instead of the physical sensor. One potential cause may be that the digital twin models may not have had enough training data under diverse operating conditions that capture the changing environment of the physical counterpart.

Data quality (e.g., ensuring that the data set is accurate, complete, valid, and consistent) is another major concern for digital twins. Consider data assimilation for the artificial pancreas or closed-loop pump (insulin and glucagon). The continuous glucose monitor has an error range, as does the glucometer check, which itself is dependent on compliance from the human user (e.g., washing hands before the glucose check). Data assimilation techniques for digital twins must be able to handle challenges with multiple inputs from the glucose monitor and the glucometer, especially if they provide very different glucose levels, differ in units from different countries (e.g., mmol/L or mg/dL), or lack regular calibration of the glucometer. Assessing and documenting data quality, including completeness and measures taken to curate the data, tools used at each step of the way, and benchmarks against which any model is evaluated, are integral parts of developing and maintaining a library of reproducible models that can be embedded in a digital twin system.

Finding 4-1: Documenting data quality and the metadata that reflect the data provenance is critical.

Without clear guidelines for defining the objectives and use cases of digital twin technology, it can be challenging to identify critical components that significantly impact the physical system’s performance (VanDerHorn and Mahadevan

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

2021). The absence of standardized quality assurance frameworks makes it difficult to compare and validate results across different organizations and systems.

Finding 4-2: The absence of standardized quality assurance frameworks makes it difficult to compare and validate results across different organizations and systems. This is important for cybersecurity and information and decision sciences. Integrating data from various sources, including Internet of Things devices, sensors, and historical data, can be challenging due to differences in data format, quality, and structure.

Considerations for Sensors

Sensors provide timely data on the condition of the physical counterpart. Improvements in sensor integrity, performance, and reliability will all play a crucial role in advancing the reliability of digital twin technology; this requires research into sensor calibration, performance, maintenance, and fusion methods. Detecting and mitigating adversarial attacks on sensors, such as tampering or false data injection, is essential for preserving system integrity and prediction fidelity. Finally, multimodal sensors that combine multiple sensing technologies may enhance the accuracy and reliability of data collection. Data integration is explored further in the next section. A related set of research questions around optimal sensor placement, sensor steering, and sensor dynamic scheduling is discussed in Chapter 6.

DATA INTEGRATION FOR DIGITAL TWINS

Increased access to diverse and dynamic streams of data from sensors and instruments can inform decision-making and improve model reliability and robustness. The digital twin of a complex physical system often gets data in different formats from multiple sources with different levels of verification and validation (e.g., visual inspection, record of repairs and overhauls, and quantitative sensor data from a limited number of locations). Integrating data from various sources—including Internet of Things devices, sensors, and historical data—can be challenging due to differences in data format, quality, and structure. Data interoperability (i.e., the ability for two or more systems to exchange and use information from other systems) and integration are important considerations for digital twins, but current efforts toward semantic integration are not scalable. Adequate metadata are critical to enabling data interoperability, harmonization, and integration, as well as informing appropriate use (Chung and Jaffray 2021). The transmission and level of key information needed and how to incorporate it in the digital twin are not well understood, and efforts to standardize metadata exist but are not yet sufficient for the needs of digital twins. Developers and end

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

users would benefit from collaboratively addressing the needed type and format of data prior to deployment.

Handling Large Amounts of Data

In some applications, data may be streaming at full four-dimensional resolution and coupled with applications on the fly. This produces significantly large amounts of data for processing. Due to the large and streaming nature of some data sets, all operations must be running in continuous or on-demand modes (e.g., ML models need to be trained and applied on the fly, applications must operate in fully immersive data spaces, and data assimilation and data handling architecture must be scalable). Specific challenges around data assimilation and the associated verification, validation, and uncertainty quantification efforts are discussed further in Chapter 5. Historically, data assimilation methods have been model-based and developed independently from data-driven ML models. In the context of digital twins, however, these two paradigms will require integration. For instance, ML methods used within digital twins need to be optimized to facilitate data assimilation with large-scale streaming data, and data assimilation methods that leverage ML models, architectures, and computational frameworks need to be developed.

The scalability of data storage, movement, and management solutions becomes an issue as the amount of data collected from digital twin systems increases. In some settings, the digital twin will face computational resource constraints (e.g., as a result of power constraints); in such cases, low-power ML and data assimilation methods are required. Approaches based on subsampling data (i.e., only using a subset of the available data to update the digital twin’s virtual models) necessitate statistical and ML methods that operate reliably and robustly with limited data. Foundational research on the sample complexity of ML methods as well as pretrained and foundational models that only require limited data for fine tuning are essential to this endeavor. Additional approaches requiring further research and development include model compression, which facilitates the efficient evaluation of deployed models; dimensionality reduction (particularly in dynamic environments); and low-power hardware or firmware deployments of ML and data assimilation tools.

In addition, when streaming data are being collected and assimilated continuously, models must be updated incrementally. Online and incremental learning methods play an important role here. A core challenge is setting the learning rate in these models. The learning rate controls to what extent the model retains its memory of past system states as opposed to adapting to new data. This rate as well as other model hyperparameters must be set and tuned on the fly, in contrast to the standard paradigm of offline tuning using holdout data from the same distribution as training data. Methods for adaptively setting a learning rate, so that it is low enough to provide robustness to noisy and other data errors when the

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

underlying state is slowly varying yet can be increased when the state changes sharply (e.g., in hybrid or switched dynamical systems), are a critical research challenge for digital twins. Finally, note that the data quality challenges outlined above are present in the large-scale streaming data setting as well, making the challenge of adaptive model training in the presence of anomalies and outliers that may correspond to either sensor failures or salient rare events particularly challenging.

Data Fusion and Synchronization

Digital twins can integrate data from different data streams, which provides a means to address missing data or data sparsity, but there are specific concerns regarding data synchronization (e.g., across scales) and data interoperability. For example, the heterogeneity of data sources (e.g., data from diverse sensor systems) can present challenges for data assimilation in digital twins. Specific challenges include the need to estimate the impact of missing data as well as the need to integrate data uncertainties and errors in future workflows. The integration of heterogeneous data requires macro to micro levels of statistical synthesis that span multiple levels, scales, and fidelities. Moreover, approaches must be able to handle mismatched digital representations. Recent efforts in the ML community on multiview learning and joint representation learning of data from disparate sources (e.g., learning a joint representation space for images and their text captions, facilitating the automatic captioning of new images) provide a collection of tools for building models based on disparate data sources.

For example, in tumor detection using magnetic resonance imaging (MRI), results depend on the radiologist identifying the tumor and measuring the linear diameter manually (which is susceptible to inter- and intra-observer variability). There are efforts to automate the detection, segmentation, and/or measurement of tumors (e.g., using AI and ML approaches), but these are still vulnerable to upstream variability in image acquisition (e.g., a very small 2 mm tumor may be detected on a high-quality MRI but may not be visible on a poorer quality machine). Assimilating serial tumor measurement data is a complex challenge due to patients being scanned in different scanners with different protocols over time.

Data fusion and synchronization are further exacerbated by disparate sampling rates, complete or partial duplication of records, and different data collection contexts, which may result in seemingly contradictory data. The degree to which data collection is done in real time (or near real time) is dependent on the intended purpose of the digital twin system as well as available resources. For example, an ambulatory care system has sporadic electronic health record data, while intensive care unit sensor data are acquired at a much faster sampling rate. Additionally, in some systems, data imputation to mitigate effects of missing data will also require the development of imputation models learned from data.

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

Lack of standardization creates interoperability issues while integrating data from different sources.

Conclusion 4-1: The lack of adopted standards in data generation hinders the interoperability of data required for digital twins. Fundamental challenges include aggregating uncertainty across different data modalities and scales as well as addressing missing data. Strategies for data sharing and collaboration must address challenges such as data ownership and intellectual property issues while maintaining data security and privacy.

Challenges with Data Access and Collaboration

Digital twins are an inherently multidisciplinary and collaborative effort. Data from multiple stakeholders may be integrated and/or shared across communities. Strategies for data collaboration must address challenges such as data ownership, responsibility, and intellectual property issues prior to data usage and digital twin deployment.

Some of these challenges can be seen in Earth science research, which has been integrating data from multiple sources for decades. Since the late 1970s, Earth observing satellites have been taking measurements that provide a nearly simultaneous global estimate of the state of the Earth system. When combined through data assimilation with in situ measurements from a variety of platforms (e.g., surface stations, ships, aircraft, and balloons), they provide global initial conditions for a numerical model to produce forecasts and also provide a basis for development and improvement of models (Ackerman et al. 2019; Balsamo et al. 2018; Fu et al. 2019; Ghil et al. 1979). The combination of general circulation models of the atmosphere, coupled models of the ocean–atmosphere system, and Earth system models that include biogeochemical models of the carbon cycle together with global, synoptic observations and a data assimilation method represent a digital twin of the Earth system that can be used to make weather forecasts and simulate climate variability and change. Numerical weather prediction systems are also used to assess the relative value of different observing systems and individual observing stations (Gelaro and Zhu 2009).

KEY GAPS, NEEDS, AND OPPORTUNITIES

In Table 4-1, the committee highlights key gaps, needs, and opportunities for managing the physical counterpart of a digital twin. There are many gaps, needs, and opportunities associated with data management more broadly; here the committee focuses on those for which digital twins bring unique challenges. This is not meant to be an exhaustive list of all opportunities presented in the chapter. For the purposes of this report, prioritization of a gap is indicated by 1 or 2. While the committee believes all of the gaps listed are of high priority, gaps

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.

TABLE 4-1 Key Gaps, Needs, and Opportunities for Managing the Physical Counterpart of a Digital Twin

Maturity Priority
Early and Preliminary Stages
Standards to facilitate interoperability of data and models for digital twins (e.g., by regulatory bodies) are lacking. 1
Undersampling in complex systems with large spatiotemporal variability is a significant challenge for acquiring the data needed for digital twin development. This undersampling could result in an incomplete characterization of the system and lead to overlooking critical events or significant features. It also introduces uncertainty that could propagate through the digital twin’s predictive models, potentially leading to inaccurate or misleading outcomes. Understanding and quantifying this uncertainty is vital for assessing the reliability and limitations of the digital twin, especially in safety-critical or high-stakes applications. 2
Data imputation approaches for high volume and multimodal data are needed. 2
Some Research Base Exists But Additional Investment Required
Tools are needed for data and metadata handling and management to ensure that data and metadata are gathered, stored, and processed efficiently. 1
There is a gap in the mathematical tools available for assessing data quality, determining appropriate utilization of all available information, understanding how data quality affects the performance of digital twin systems, and guiding the choice of an appropriate algorithm. 2

marked 1 may benefit from initial investment before moving on to gaps marked with a priority of 2.

REFERENCES

Ackerman, S.A, S. Platnick, P.K. Bhartia, B. Duncan, T. L’Ecuyer, A. Heidinger, G.J. Skofronick, N. Loeb, T. Schmit, and N. Smith. 2019. “Satellites See the World’s Atmosphere.” Meteorological Monographs 59(1):1–53.

Balsamo, G., A.A. Parareda, C. Albergel, C. Arduini, A. Beljaars, J. Bidlot, E. Blyth, et al. 2018. “Satellite and In Situ Observations for Advancing Global Earth Surface Modelling: A Review.” Remote Sensing 10(12):2038.

Chung, C., and D. Jaffray. 2021. “Cancer Needs a Robust ‘Metadata Supply Chain’ to Realize the Promise of Artificial Intelligence.” American Association for Cancer Research 81(23):5810–5812.

Fu, L.L., T. Lee, W.T. Liu, and R. Kwok. 2019. “50 Years of Satellite Remote Sensing of the Ocean.” Meteorological Monographs 59(1):1–46.

Gelaro, R., and Y. Zhu. 2009. “Examination of Observation Impacts Derived from Observing System Experiments (OSEs) and Adjoint Models.” Tellus A: Dynamic Meteorology and Oceanography 61(2):179–193.

Ghil, M., M. Halem, and R. Atlas. 1979. “Time-Continuous Assimilation of Remote-Sounding Data and Its Effect on Weather Forecasting.” Monthly Weather Review 107(2):140–171.

VanDerHorn, E., and S. Mahadevan. 2021. “Digital Twin: Generalization, Characterization and Implementation.” Decision Support Systems 145:113524.

Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 69
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 70
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 71
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 72
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 73
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 74
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 75
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 76
Suggested Citation: "4 The Physical Counterpart: Foundational Research Needs and Opportunities." National Academies of Sciences, Engineering, and Medicine. 2024. Foundational Research Gaps and Future Directions for Digital Twins. Washington, DC: The National Academies Press. doi: 10.17226/26894.
Page 77
Next Chapter: 5 Feedback Flow from Physical to Virtual: Foundational Research Needs and Opportunities
Subscribe to Email from the National Academies
Keep up with all of the activities, publications, and events by subscribing to free updates by email.