Read "Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers" at NAP.edu

Page 81 Cite Bookmark

Suggested Citation: "CONCLUDING NOTE." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

matching, although it is possible, but difficult, to apply this technique to the case of constrained statistical matching.

Rather than selecting the closest match in file B to each record in file A, identify the closest k records. It is unclear what k should be; it would depend on the size of the classes within which matching is permitted, choosing larger k’s for larger classes. It is likely that setting k to values close to 5 would work most of the time. Three statistically matched files can then be created: (1) the usual unconstrained statistical match, using the closest match in file B to every record in file A and assuming conditional independence; (2) a negative conditional correlation statistical match, for which one chooses to match a particular one of the k nearest records in file B to a record in file A, where the record is chosen so that “high” values of Y are paired with “low” values of Z, and vice versa; and (3) a positive conditional correlation statistical match, similar to (2). If there is a particular variable contained in Y and another variable contained in Z that one has primary interest in, “high” and “low” can simply mean above and below that variable’s mean. However, if there are several variables contained in Y and Z that are important and if the conditional independence assumption is a concern, then either one could repeat this process for each pair of interest, or one could use a multivariate notion of “high” and “low.”

After forming these three statistically merged data files, one would repeat the analysis on each file. If the results were similar, the assumption of conditional independence probably is not crucial; otherwise, the results are open to question.

CONCLUDING NOTE

The specific application of statistical matching as input into microsimulation models (possibly the most extensive use of the methodology, but certainly not the only one) makes certain demands on the data set that must be recognized when producing statistically matched files for this purpose. Microsimulation models often operate on data sets that are fairly large. If the model is of national scope and is based on individuals or households, files on the order of 50,000 or more are typical. The use of data sets of this size or larger makes constrained statistical matching computationally intensive, especially considering the costs involved with repeating the matching process when estimating the variance of such a process with a sample reuse technique. In addition, the complexity of the policy issues—for example, eligibility for various welfare programs, income taxes, health expenditures—requires that the data sets cover a wide range of variables. If there are a large number of matching variables, say, more than five or six, matching error increases. If there are a large number of Y or Z variables, there are likely to be several uncorrelated pairs, which complicates the choice of a distance function in the match.

Furthermore, the extensive use of controlling to accepted totals on the

Page 82 Cite Bookmark

statistically matched files needs to be considered. Rubin’s point about the relative efficacy of constrained versus unconstrained statistical matching depends strongly on whether various control totals are going to be used after the statistical match. Also, Klevmarken’s points about the limits of statistical operations that one can safely apply to a statistically matched data set have only been considered in the regression context. His points should also be considered for other models such as logistic regression (found in some participation models of microsimulation models) and iterative proportional fitting.

Finally, it is not at all clear what impact processes, such as aging the data, statically or dynamically, or use of various behavioral models, have on a statistically matched data set. There is the possibility that the sensitivity of the results to the conditional independence assumption is heightened through the use of such data-intensive procedures.

The use of what one might call “classical” statistical matching in microsimulation models, that is, assuming without evidence the conditional independence assumption, is very likely to misinform. At the very least, some of the sensitivity analysis described above should be performed to assess the likely effect due to failure of the assumption. If the results are not sensitive to the conditional independence assumption, and the bias introduced through the matching process is also tested and considered small, then the results are likely to be useful. In the event that the results are sensitive, to either the conditional independence assumption or the matching bias or both, a “classical” statistical match should not be used. These conclusions are true (almost) regardless of the application of the statistical match. They are even more crucial for statistical matching as input into microsimulation models, since these files are further manipulated by aging routines, monthly allocation routines, behavioral models, various sorts of controlling to independent totals, etc.

Rodgers (1984:101) summarized:

On the basis of these simulations, which confirm the caution arising from the absence of any mathematical justification for statistical matching, it seems clear that statistical matching may not in general be an acceptable procedure for estimating relationships between Y and Z variables, or for any type of multivariate analysis involving both Y and Z variables.

Paass (1985:9.3–15) summarized:

At the current state of knowledge SM [statistical matching] is more an art than an exact and reliable technique. Therefore SM methods should be employed only if the CIA [conditional independence assumption] can be verified or replaced by additional information and the demands on the data are not very high.

It seems as if microsimulation models place very high demands on data, and those words of caution should be heeded.

However, it is important to remember the important function statistical

Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers (1991)

Chapter: CONCLUDING NOTE

CONCLUDING NOTE

My Academies

Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers (1991)

Chapter: CONCLUDING NOTE

CONCLUDING NOTE