for the common variables leads to reduced distortion in the joint distribution of (X, Z) on files created by matching.
A related idea, proposed in Singh (1988), develops categorical variables X*, Y*, and Z*, related to X, Y, and Z, for which the conditional independence assumption is assumed to hold, and which are used to define equivalence classes for matching; for details, see Singh (1988).
In this section I describe some applications of statistical matching, including the reasons for the match and the particular matching techniques used.
It is well known that estimates of the distribution of family money income from household surveys contain serious bias. This bias can be reduced through the use of information from federal individual income tax returns. Radner (1983) describes a statistical match that begins with the March 1973 CPS-Internal Revenue Service-Social Security Administration exact match file (EM). This file was considered to have three limitations: (1) serious response errors in the CPS, (2) few high-income observations, and (3) not enough detail by income type. To address these limitations in the EM, it was statistically matched to the augmentation file (AF), a subsample of the 1972 Statistics of Income (SOI) sample of federal individual income tax returns that had been exact matched with Social Security Administration records containing earnings and demographic data.
The EM-AF statistical match can be separated into three fairly distinct steps. First, there was an initial match, using 22 matching variables that included adjusted gross income, interest, dividends, and social security taxable earnings, sex, race, age, number of exemptions, and the use of various schedules. Certain of the characteristics were used to define cells within which distances between records were computed and outside of which no matches were permitted. These cells included an acceptable age range. The distance measure consisted of a sum of weighted discrepancies between the values for the 22 variables for the two files. The AF record that was closest to the EM record was chosen for the statistical match unless the minimal distance was greater than a specified maximum, in which case some cells were collapsed and the age range was eliminated
Next, Radner (1983:137) describes:
About 6,900 EM records that were considered to have an inconsistent initial match were rematched with the AF because we were not fully satisfied
Sign in to access your saved publications, downloads, and email preferences.
Former MyNAP users: You'll need to reset your password on your first login to MyAcademies. Click "Forgot password" below to receive a reset link via email. Having trouble? Visit our FAQ page to contact support.
Members of the National Academy of Sciences, National Academy of Engineering, or National Academy of Medicine should log in through their respective Academy portals.
While logged on as a guest, you can download any of our free PDFs on nationalacademies.org . You will remain logged in until you close your browser.
Thank you for creating a MyAcademies account!
Enjoy free access to thousands of National Academies' publications, a 10% discount off every purchase, and build your personal library.
Enter the email address for your MyAcademies (formerly MyNAP) account to receive password reset instructions.
We sent password reset instructions to your email . Follow the link in that email to create a new password. Didn't receive it? Check your spam folder or contact us for assistance.
Your password has been reset.
Verify Your Email Address
We sent a verification link to your email. Please check your inbox (and spam folder) and follow the link to verify your email address. If you did not receive the email, you can request a new verification link below