Dropdown items
My Academies

Personal Library

Account settings

Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers (1991)

Chapter: Limitations in Modeling

Visit NAP.edu/10766 to get more information about this book, to buy it in print, or to download it as a free PDF.

Previous chapter Next chapter
Page of 351
Search this publication

Previous Chapter: Conditional Independence

Page 75 Cite Bookmark

Suggested Citation: "Limitations in Modeling." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

where (for ease of notation) Y represents Y_i, Z represents Z_j, and X represents X_k(A) or X_k(B). Often, as discussed above, the X(A) variables are selected so that variables in Y and Z will be well explained by X(A). Implicitly it is reasoned that if both ρ_YX and ρ_ZX are close to 1, then the numerator of ρ_YZ.X will be close to 0, or, what here amounts to the same thing, ρ_YZ will be close to 1. To some extent this reasoning is valid, but it is surprising how variable the correlation between Y_i and Z_j, ρ_YZ, can be even when ρ_YX and ρ_ZX are fairly close to 1. This variability is disturbing since the estimation of these correlations is presumably a major reason the statistical match was performed.

The variability of ρ_YZ can be seen from the above formula. By setting ρ_YZ.X equal to −1 and 1,

To take an example from Rodgers (1984), assume that ρ_YX equals .8 and ρ_ZX equals .8. Then ρ_YZ ranges from 0.28 to 1.0. More generally, we see that

and the correlation between Y_i and Z_j is completely determined by ρ_YX and ρ_ZX only when at least one of them is essentially 1, or when ρ_YZ._X equals 0. Thus, knowledge about the relationships between X(B) and Z_j and between X(A) and Y_i, from different files, typically is not sufficient to completely inform about the relationships between Y_i and Z_j. Armstrong (1990:1) points out:

Distortion of type (iii) [distortion in the multivariate distribution of X, Y, and Z] is often unavoidable when statistical matching methods are employed. Statistical matching methods involve the assumption that Y and Z are independent conditional on X. When this assumption is violated, type (iii) distortion is inevitable.

Moreover, when one of the correlations, ρ_YX or ρ_ZX, is essentially equal to 1, what is the benefit of statistical matching? In that case one could use the linear combination of X(A) as a surrogate for the missing covariate.

Paass (1985) thinks that the conditional independence assumption is almost inextricably linked with the distance measure used. This view makes sense because one can make the matches that are consistent with an assumed probabilistic structure more likely through the choice of the distance measure. For example, if one believes that Z and Y are negatively correlated conditioned on X, a distance measure can encourage the joining of records when this obtains. Paass (1985) mentions a variety of ways this can be accomplished, along with some simulation results (see also discussion below).

Limitations in Modeling

Even after a statistically matched data set is created, statistical models cannot be

Next Chapter: Reweighting of File B Data Resulting From Statistical Matching

Subscribe to Emails from the National Academies

Stay up to date on activities, publications, and events by subscribing to email updates.

My Academies

Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers (1991)

Chapter: Limitations in Modeling

Limitations in Modeling