The panel’s experiment was successful in demonstrating that microsimulation model validation is feasible. Methods currently exist that enable analysts to measure the degree of variability attributable to the use of alternative components. Such information helps assess overall model uncertainty as well as which components to examine further to make improvements to the model. Thus, sensitivity analysis methods, especially when augmented with comparison values in an external validation, provide a great deal of data with which to direct efforts at model development as well as to measure model uncertainty.
These methods have demonstrated that there is a great deal of uncertainty due to changes in these three modules in TRIM2. Therefore, the choice of which model version to use makes a great difference. On the other hand, it is not clear that any one of the 16 versions has any advantage over the others. (If there is a winner, it is the current version of TRIM2.) Certainly, for individual responses, certain versions fared better. However, given that the experiment is only one replication, we hesitate to conclude that this was confirmation of any real particular modeling advantage.
Because the experiment did not attempt to measure the intramodel variance of any of the outputs of the various versions of TRIM2, we have no idea of the relative sizes of various sources of uncertainty in relation to variance. Therefore, it is difficult to assign a priority to the development of variance estimates vis-à-vis use of sensitivity analysis. We do believe that both are sufficiently important to warrant investigation.
It should be stressed that the present experiment was purely illustrative. The benefits to be obtained by a continued process of validation are rarely evidenced through study of a single situation. There is an important question about the degree to which different studies of this sort of the same model in different modeling situations would represent replications in any sense. However, even if the studies are not replications, use of these methods will provide evidence of general trends in model performance. Their use will generate a great deal of information as to the situations under which a model performs well and can be trusted to provide accurate information.
While a convincing case for the feasibility of sensitivity analyses and external validation has been made here, the experiment was not cheap. (See Table 8 for a description of the costs of the experiment.) The Urban Institute estimated that loaded staff costs to conduct the experiment were about $60,000 for 1,400 person-hours of effort or, roughly, 35 person-weeks. These estimates are certainly open to some question, because it was difficult for the Urban Institute to separate activities that were needed for the experiment from its own day-to-day work. In addition, the time taken to specify the experiment and analyze the data was not taken into consideration, and there are no estimates of computer costs. However, the overall impression from the cost and time
TABLE 8 Direct Time and Cost of Urban Institute Staff on Experiment
estimates is not open to question. The way in which TRIM2 (and most other microsimulation models) is currently configured can make a sensitivity analysis very costly. These costs were dramatically affected by the interest in trying out different forms of aging. The overall cost would have been substantially reduced, possibly by 20 percent, had an easier module been selected for the experiment. On the other hand, there were other factors that were not investigated because the costs of working with them would have been even higher. It is obvious that, for model validation to become a routine part of the model development and policy analysis process, the structure of the next generation of models must facilitate the type of module substitution that is necessary for sensitivity analyses.