In the digital twin feedback flow from physical to virtual, inverse problem methodologies and data assimilation are required for combining physical observations and virtual models in a rigorous, systematic, and scalable way. This chapter addresses specific challenges for digital twins including calibration and updating on actionable time scales. These challenges represent foundational gaps in inverse problem and data assimilation theory, methodology, and computational approaches.
Digital twin calibration is the process of estimating numerical model parameters for individualized digital twin virtual representations. This task of estimating numerical model parameters and states that are not directly observable can be posed mathematically as an inverse problem, but the problem may be ill posed. Bayesian approaches can be used to incorporate expert knowledge that constrains solutions and predictions. It must be noted, however, that for some settings, specification of prior distributions can greatly impact the inferences that a digital twin is meant to provide—for better or for worse. Digital twins present specific challenges to Bayesian approaches, including the need for good priors that capture tails of distributions, the need to incorporate model errors and updates, and the need for robust and scalable methods under uncertainty and for high-consequence decisions. This presents a new class of open problems in the realm of inverse problems for large-scale complex systems.
The process of estimating numerical model parameters from data is an ill-posed problem, whereby the solution may not exist, may not be unique, or may not depend continuously on the data. The first two conditions are related to identifiability of solutions. The third condition is related to the stability of the problem; in some cases, small errors in the data may result in large errors in the reconstructed parameters. Bayesian regularization, in which priors are encoded using probability distribution functions, can be used to handle missing information, ill-posedness, and uncertainty. A specific challenge for digital twins is that standard priors—such as those based on simple Gaussian assumptions—may not be informative and representative for making high-stakes decisions. Also, due to the continuous feedback loop, updated models need to be included on the fly (without restarting from scratch). Moreover, the prior for one problem may be taken as the posterior from a previous problem, so it is important to assign probabilities to data and priors in a rigorous way such that the posterior probability is consistent when using a Bayesian framework.
Approaches to learn priors through existing data (e.g., machine learning–informed bias correction) can work well in data-rich environments but may not accurately represent or predict extreme events because of limited relevant training data. Bayesian formulations require priors for the unknown parameters, which may depend on expensive-to-tune hyperparameters. Data-driven regularization approaches that incorporate more realistic priors are necessary for digital twins.
Another key challenge is to perform optimization of numerical model parameters (and any additional hyperparameters) under uncertainty—any computational model must be calibrated to meet its requirements and be fit for purpose. In general, optimization under uncertainty is challenging because the cost functions are stochastic and must be able to incorporate different types of uncertainty and missing information. Bayesian optimization and stochastic optimization approaches (e.g., online learning) can be used, and some fundamental challenges—such as obtaining sensitivity information from legacy code with missing adjoints—are discussed in Chapter 6.
These challenges are compounded for digital twin model calibration, especially when models are needed at multiple resolutions. Methods are needed for fast sampling of parametric and structural uncertainty. For digital twins to support high-consequence decisions, methods may need to be tuned to risk and extreme events, accounting for worst-case scenarios. Risk-adaptive loss functions and data-informed prior distribution functions for capturing extreme events and for incorporating risk during inversion merit further exploration. Non-differentiability also becomes a significant concern as mathematical models may demonstrate
discontinuous behavior or numerical artifacts may result in models that appear non-differentiable. Moreover, models may even be chaotic, which can be intractable for adjoint and tangent linear models. Standard loss functions, such as the least-squares loss, are not able to model chaotic behavior in the data (Royset 2023) and are not able to represent complex statistical distributions of model errors that arise from issues such as using a reduced or low-fidelity digital-forward model. Robust and stable optimization techniques (beyond gradient-based methods) to handle new loss functions and to address high displacements (e.g., the upper tail of a distribution) that are not captured using only the mean and standard deviation are needed.
Data assimilation tools have been used heavily in numerical weather forecasting, and they can be critical for digital twins broadly, including to improve model states based on current observations. Still, there is more to be exploited in the bidirectional feedback flow between physical and virtual beyond standard data assimilation (Blair 2021).
First, existing data assimilation methods rely heavily on assumptions of high-fidelity models. However, due to the continual and dynamic nature of digital twins, the validity of a model’s assumptions—and thus the model’s fidelity—may evolve over time, especially as the physical counterpart undergoes significant shifts in condition and properties. A second challenge is the need to perform uncertainty quantification for high-consequence decisions on actionable time scales. This becomes particularly challenging for large-scale complex systems with high-dimensional parameter and state spaces. Direct simulations and inversions (e.g., in the case of variational methods) needed for data assimilation are no longer feasible. Third, with different digital technologies providing data at unprecedented rates, there are few mechanisms for integrating artificial intelligence, machine learning, and data science tools for updating digital twins.
Digital twins require continual feedback from the physical to virtual, often using partial and noisy observations. Updates to the twin should be incorporated in a timely way (oftentimes immediately), so that the updated digital twin may be used for further forecasting, prediction, and guidance on where to obtain new data. These updates may be initiated when something in the physical counterpart evolves or in response to changes in the virtual representation, such as improved model parameters, a higher-fidelity model that incorporates new physical understanding, or improvements in scale/resolution. Due to the continual nature of digital twins as well as the presence of errors and noise in the models, the observations, and the initial conditions, sequential data assimilation approaches (e.g.,
particle-based approaches and ensemble Kalman filters) are the natural choice for state and parameter estimation. However, these probabilistic approaches have some disadvantages compared to variational approaches, such as sampling errors, rank deficiency, and inconsistent assimilation of asynchronous observations.
Data assimilation techniques need to be adapted for continuous streams of data from different sources and need to interface with numerical models with potentially varying levels of uncertainty. These methods need to be able to infer system state under uncertainty when a system is evolving and be able to integrate model updates efficiently. Moreover, navigating discrepancies between predictions and observed data requires the development of tools for model update documentation and hierarchy tracking.
Conclusion 5-1: Data assimilation and model updating play central roles in the physical-to-virtual flow of a digital twin. Data assimilation techniques are needed for data streams from different sources and for numerical models with varying levels of uncertainty. A successful digital twin will require the continuous assessment of models. Traceability of model hierarchies and reproducibility of results are not fully considered in existing data assimilation approaches.
Most literature focuses on offline data assimilation, but the assimilation of real-time sensor data for digital twins to be used on actionable time scales will require advancements in data assimilation methods and tight coupling with the control or decision-support task at hand (see Chapter 6).
For example, the vast, global observing system of the Earth’s atmosphere and numerical models of its dynamics and processes are combined in a data assimilation framework to create initial conditions for weather forecasts. In order for a weather forecast to have value, it must be delivered within a short interval of real time. This requires a huge computational and communications apparatus of gathering, ingesting, processing, and assimilating global observations within a window of a few hours. High-performance computing implementations of state-of-the-art data assimilation codes and new data assimilation approaches that can exploit effective dimensionality within an optimization/outer-loop approach for obtaining optimal solutions (e.g., latent data assimilation to reduce the dimensionality of the data) are needed.
Data assimilation provides a framework for combining model-based predictions and their uncertainties with observations, but it lacks the decision-making
interface—including measures of risk—needed for digital twins. Bayesian estimation and inverse modeling provide the mathematical tools for quantifying uncertainty about a system. Given data, Bayesian parameter estimation can be used to select the best model and to infer posterior probability distributions for numerical model parameters. Forward propagation of these distributions then leads to a posterior prediction, in which the digital twin aids decision-making by providing an estimate of the prediction quantities of interest and their uncertainties. This process provides predictions and credible intervals for quantities of interest but relies heavily on prior assumptions and risk-informed likelihoods, as well as advanced computational techniques such as Gaussian process emulators for integrating various sources of uncertainty. For solving the Bayesian inference problem, sampling approaches such as Markov chain Monte Carlo are prohibitive because of the many thousands of forward-problem solves (i.e., model simulations) that would be needed. Machine learning has the potential to support uncertainty quantification through approaches such as diffusion models or other generative artificial intelligence methods that can capture uncertainties, but the lack of theory and the need for large ensembles and data sets provide additional challenges. Increasing computational capacity alone will not address these issues.
For many digital twins, the sheer number of numerical model parameters that need to be estimated and updated can present computational issues of tractability and identifiability. For example, a climate model may have hundreds of millions of spatial degrees of freedom. Performing data assimilation and optimization under uncertainty for such large-scale complex systems is not feasible. Strategies include reducing the dimensionality of the numerical model parameters via surrogate models (see Chapter 3), imposing structure or more informative priors (e.g., using Bayesian neural networks or sparsity-promoting regularizers), and developing goal-oriented approaches for problems where quantities of interest from predictions can be identified and estimated directly from the data. Goal-oriented approaches for optimal design, control, and decision support are addressed in Chapter 6.
For data-rich scenarios, there are fundamental challenges related to the integration of massive amounts of observational data being collected. For example, novel atmospheric observational platforms (e.g., smart devices that can sense atmospheric properties like temperature) provide a diversity of observational frequency, density, and error characteristics. This provides an opportunity for more rapid and timely updating of the state of the atmosphere in the digital twin, but it also represents a challenge for existing data assimilation techniques that are not able to utilize all information from various types of instrumentations.
Conclusion 5-2: Data assimilation alone lacks the learning ability needed for a digital twin. The integration of data science with tools for digital twins (including inverse problems and data assimilation) will provide opportunities to extract new insights from data.
In Table 5-1, the committee highlights key gaps, needs, and opportunities for enabling the feedback flow from the physical counterpart to the virtual representation of a digital twin. This is not meant to be an exhaustive list of all opportunities presented in the chapter. For the purposes of this report, prioritization of a gap is indicated by 1 or 2. While the committee believes all of the gaps listed are of high priority, gaps marked 1 may benefit from initial investment before moving on to gaps marked with a priority of 2.
TABLE 5-1 Key Gaps, Needs, and Opportunities for Enabling the Feedback Flow from the Physical Counterpart to the Virtual Representation of a Digital Twin
| Maturity | Priority |
|---|---|
| Early and Preliminary Stages | |
| Tools for tracking model and related data provenance (i.e., maintaining a history of model updates and tracking model hierarchies) to handle scenarios where predictions do not agree with observed data are limited. Certain domains and sectors have had more success, such as the climate and atmospheric sciences. | 1 |
| New uncertainty quantification methods for large-scale problems that can capture extreme behavior and provide reliable uncertainty and risk analysis are needed. New data assimilation methods that can handle more channels of data and data coming from multiple sources at different scales with different levels of uncertainty are also needed. | 1 |
| Some Research Base Exists But Additional Investment Required | |
| Risk-adaptive loss functions and data-informed prior distribution functions for capturing extreme events and for incorporating risk during inversion are needed. Also needed are robust and stable optimization techniques (beyond gradient-based methods) to handle new loss functions and to address high displacements (e.g., the upper tail of a distribution) that are not captured using only the mean and standard deviation. | 1 |
| High-performance computing implementations of state-of-the-art data assimilation codes (ranging from high-dimensional particle filters to well-studied ensemble Kalman filters, or emulators) and new data assimilation approaches that can exploit effective dimensionality within an optimization/outer-loop approach for obtaining optimal solutions (e.g., latent data assimilation to reduce the dimensionality of the data) are needed. | 2 |
| Maturity | Priority |
|---|---|
| Machine learning has the potential to support uncertainty quantification through approaches such as diffusion models or other generative artificial intelligence methods that can capture uncertainties, but the lack of theory and the need for large ensembles and data sets provides additional challenges. | 2 |
| Standards and governance policies are critical for data quality, accuracy, security, and integrity, and frameworks play an important role in providing standards and guidelines for data collection, management, and sharing while maintaining data security and privacy. | 1 |
| Research Base Exists with Opportunities to Advance Digital Twins | |
| New approaches that incorporate more realistic prior distributions or data-driven regularization are needed. Since uncertainty quantification is often necessary, fast Bayesian methods will need to be developed to make solutions operationally practical. | 2 |
Blair, G.S. 2021. “Digital Twins of the Natural Environment.” Patterns 2(10):1–3.
Royset, J.O. 2023. “Risk-Adaptive Decision-Making and Learning.” Presentation to the Committee on Foundational Research Gaps and Future Directions for Digital Twins. February 13. Washington, DC.