Previous Chapter: INTRODUCTION
Suggested Citation: "File Treatment." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

that contains both groups of variables is often difficult to accomplish, given budget constraints, the interest in reducing respondent burden, and the need to protect the privacy and confidentiality of respondents. Yet information to inform decision making is needed. One technique for addressing this problem, which has been used for over two decades, is statistical matching. This chapter presents a general critique of statistical matching and some possible alternatives for overcoming the identified problems. (For a broad overview of statistical matching, see Radner et al. [1980].)

Definition of Statistical Matching

Mathematically, statistical matching is defined as follows. Let us call the first data set, with variables Y and X(A), data set A, where both Y and X(A) can denote several variables. The Y variables are the variables of interest, and the X(A) variables will be used for purposes of matching. The second data set, B, has variables Z and X(B) on it. The Z variables are the variables of interest, and the X(B) variables will be used for purposes of matching with data set A. Statistical matching creates complete records of the form {Y X(A) Z}—or possibly some combination of X(A) and X(B) in place of X(A)—by joining records when X(A) is “close” to X(B), for some definition of close. The process of statistical matching makes rather strong assumptions about the relationships between variables Y and Z. This issue is addressed below.

Like imputation, statistical matching is a form of nonparametric regression used to fill in missing data values.1 However, statistical matching is in two important ways more extreme. First, imputation is typically used to fill in a relatively small percentage of the data; statistical matching is typically used on 100 percent of the records. Second, imputation typically makes use of complete records to fill in missing values for other records; statistical matching makes use of a conditional independence assumption since no complete records exist. The validity of this conditional independence assumption is often untestable.

File Treatment

Before two files can be statistically matched, the files may require some treatment. First, the variables X(A) and X(B) may not be immediately comparable. For example, a variable representing income may include some components on one file, say, income from interest and dividends, that are not included on the

1  

The term imputation is used here narrowly as a technique for replacing missing values for one or more response categories. Other analysts use the term in a broader sense for the technique of creating all of the values for one or more missing variables that were never asked in a survey (or never collected in an administrative records system). Imputation of the latter type—for example, on the basis of regression equations estimated from another data source—may exhibit some of the same problems as statistical matching.

Suggested Citation: "File Treatment." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 63
Next Chapter: Constrained and Unconstrained Statistical Matching
Subscribe to Emails from the National Academies
Stay up to date on activities, publications, and events by subscribing to email updates.