where (for ease of notation) Y represents Yi, Z represents Zj, and X represents Xk(A) or Xk(B). Often, as discussed above, the X(A) variables are selected so that variables in Y and Z will be well explained by X(A). Implicitly it is reasoned that if both ρYX and ρZX are close to 1, then the numerator of ρYZ.X will be close to 0, or, what here amounts to the same thing, ρYZ will be close to 1. To some extent this reasoning is valid, but it is surprising how variable the correlation between Yi and Zj, ρYZ, can be even when ρYX and ρZX are fairly close to 1. This variability is disturbing since the estimation of these correlations is presumably a major reason the statistical match was performed.
The variability of ρYZ can be seen from the above formula. By setting ρYZ.X equal to −1 and 1,
To take an example from Rodgers (1984), assume that ρYX equals .8 and ρZX equals .8. Then ρYZ ranges from 0.28 to 1.0. More generally, we see that
and the correlation between Yi and Zj is completely determined by ρYX and ρZX only when at least one of them is essentially 1, or when ρYZ.X equals 0. Thus, knowledge about the relationships between X(B) and Zj and between X(A) and Yi, from different files, typically is not sufficient to completely inform about the relationships between Yi and Zj. Armstrong (1990:1) points out:
Distortion of type (iii) [distortion in the multivariate distribution of X, Y, and Z] is often unavoidable when statistical matching methods are employed. Statistical matching methods involve the assumption that Y and Z are independent conditional on X. When this assumption is violated, type (iii) distortion is inevitable.
Moreover, when one of the correlations, ρYX or ρZX, is essentially equal to 1, what is the benefit of statistical matching? In that case one could use the linear combination of X(A) as a surrogate for the missing covariate.
Paass (1985) thinks that the conditional independence assumption is almost inextricably linked with the distance measure used. This view makes sense because one can make the matches that are consistent with an assumed probabilistic structure more likely through the choice of the distance measure. For example, if one believes that Z and Y are negatively correlated conditioned on X, a distance measure can encourage the joining of records when this obtains. Paass (1985) mentions a variety of ways this can be accomplished, along with some simulation results (see also discussion below).
Even after a statistically matched data set is created, statistical models cannot be
Sign in to access your saved publications, downloads, and email preferences.
Former MyNAP users: You'll need to reset your password on your first login to MyAcademies. Click "Forgot password" below to receive a reset link via email. Having trouble? Visit our FAQ page to contact support.
Members of the National Academy of Sciences, National Academy of Engineering, or National Academy of Medicine should log in through their respective Academy portals.
Thank you for creating a MyAcademies account!
Enjoy free access to thousands of National Academies' publications, a 10% discount off every purchase, and build your personal library.
Enter the email address for your MyAcademies (formerly MyNAP) account to receive password reset instructions.
We sent password reset instructions to your email . Follow the link in that email to create a new password. Didn't receive it? Check your spam folder or contact us for assistance.
Your password has been reset.
Verify Your Email Address
We sent a verification link to your email. Please check your inbox (and spam folder) and follow the link to verify your email address. If you did not receive the email, you can request a new verification link below