that contains both groups of variables is often difficult to accomplish, given budget constraints, the interest in reducing respondent burden, and the need to protect the privacy and confidentiality of respondents. Yet information to inform decision making is needed. One technique for addressing this problem, which has been used for over two decades, is statistical matching. This chapter presents a general critique of statistical matching and some possible alternatives for overcoming the identified problems. (For a broad overview of statistical matching, see Radner et al. [1980].)
Mathematically, statistical matching is defined as follows. Let us call the first data set, with variables Y and X(A), data set A, where both Y and X(A) can denote several variables. The Y variables are the variables of interest, and the X(A) variables will be used for purposes of matching. The second data set, B, has variables Z and X(B) on it. The Z variables are the variables of interest, and the X(B) variables will be used for purposes of matching with data set A. Statistical matching creates complete records of the form {Y X(A) Z}—or possibly some combination of X(A) and X(B) in place of X(A)—by joining records when X(A) is “close” to X(B), for some definition of close. The process of statistical matching makes rather strong assumptions about the relationships between variables Y and Z. This issue is addressed below.
Like imputation, statistical matching is a form of nonparametric regression used to fill in missing data values.1 However, statistical matching is in two important ways more extreme. First, imputation is typically used to fill in a relatively small percentage of the data; statistical matching is typically used on 100 percent of the records. Second, imputation typically makes use of complete records to fill in missing values for other records; statistical matching makes use of a conditional independence assumption since no complete records exist. The validity of this conditional independence assumption is often untestable.
Before two files can be statistically matched, the files may require some treatment. First, the variables X(A) and X(B) may not be immediately comparable. For example, a variable representing income may include some components on one file, say, income from interest and dividends, that are not included on the
Sign in to access your saved publications, downloads, and email preferences.
Former MyNAP users: You'll need to reset your password on your first login to MyAcademies. Click "Forgot password" below to receive a reset link via email. Having trouble? Visit our FAQ page to contact support.
Members of the National Academy of Sciences, National Academy of Engineering, or National Academy of Medicine should log in through their respective Academy portals.
Thank you for creating a MyAcademies account!
Enjoy free access to thousands of National Academies' publications, a 10% discount off every purchase, and build your personal library.
Enter the email address for your MyAcademies (formerly MyNAP) account to receive password reset instructions.
We sent password reset instructions to your email . Follow the link in that email to create a new password. Didn't receive it? Check your spam folder or contact us for assistance.
Your password has been reset.
Verify Your Email Address
We sent a verification link to your email. Please check your inbox (and spam folder) and follow the link to verify your email address. If you did not receive the email, you can request a new verification link below