Previous Chapter: Choosing the Matching Variables
Suggested Citation: "The EM-AF Statistical Match." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

for the common variables leads to reduced distortion in the joint distribution of (X, Z) on files created by matching.

A related idea, proposed in Singh (1988), develops categorical variables X*, Y*, and Z*, related to X, Y, and Z, for which the conditional independence assumption is assumed to hold, and which are used to define equivalence classes for matching; for details, see Singh (1988).

EXAMPLES OF STATISTICAL MATCHES IN MICROSIMULATION MODELS

In this section I describe some applications of statistical matching, including the reasons for the match and the particular matching techniques used.

The EM-AF Statistical Match

It is well known that estimates of the distribution of family money income from household surveys contain serious bias. This bias can be reduced through the use of information from federal individual income tax returns. Radner (1983) describes a statistical match that begins with the March 1973 CPS-Internal Revenue Service-Social Security Administration exact match file (EM). This file was considered to have three limitations: (1) serious response errors in the CPS, (2) few high-income observations, and (3) not enough detail by income type. To address these limitations in the EM, it was statistically matched to the augmentation file (AF), a subsample of the 1972 Statistics of Income (SOI) sample of federal individual income tax returns that had been exact matched with Social Security Administration records containing earnings and demographic data.

The EM-AF statistical match can be separated into three fairly distinct steps. First, there was an initial match, using 22 matching variables that included adjusted gross income, interest, dividends, and social security taxable earnings, sex, race, age, number of exemptions, and the use of various schedules. Certain of the characteristics were used to define cells within which distances between records were computed and outside of which no matches were permitted. These cells included an acceptable age range. The distance measure consisted of a sum of weighted discrepancies between the values for the 22 variables for the two files. The AF record that was closest to the EM record was chosen for the statistical match unless the minimal distance was greater than a specified maximum, in which case some cells were collapsed and the age range was eliminated

Next, Radner (1983:137) describes:

About 6,900 EM records that were considered to have an inconsistent initial match were rematched with the AF because we were not fully satisfied

Suggested Citation: "The EM-AF Statistical Match." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 68
Next Chapter: Merge File of the Office of Tax Analysis
Subscribe to Emails from the National Academies
Stay up to date on activities, publications, and events by subscribing to email updates.