Previous Chapter: Nonparametric Analysis
Suggested Citation: "Analysis of Categorical Data." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

ignores the structure implicit in the 16 model versions. One way to avoid this is to use robust analysis of variance on the discrepancies, standardized to have roughly the same units, where the analysis of variance would be a 2×2×4 design with 13 replications per cell for the 13 different response variables.

Another problem with the above analysis is that large discrepancies cannot be distinguished from moderately large ones. This raises the general issue of the modeling goal and use of omnibus loss functions. A question that must be addressed (possibly repeatedly every few years) is whether it is desirable to have a model that predicts everything equally well or whether there are some responses that are more crucial than others. The answer, of course, determines which responses play a role in the analysis and the degree to which they are weighted. In addition, a metric on the errors must be chosen that compares errors of various magnitudes, so that it can be declared how much more disturbing 10 percent errors are, say, than 5 percent errors for each response. If a useful loss function can be identified, the implied weights could be used in a weighted analysis for most of the methods discussed here.

Analysis of Categorical Data

One of the major advantages of microsimulation models is that they provide information on the distributional impacts of changes in social welfare programs, generally unavailable from other forms of modeling. Up to now, we have analyzed only the categorical data in their dichotomized version, which was used to facilitate analysis of change. There are certainly situations where a single category, or a collection of related categories, is of primary interest, and in those cases dichotomizing so that the percentage of cases in that category (or categories) is analyzed is appropriate. However, at other times the full distribution is of interest. We therefore examined as a continuation of the external validation of TRIM2, for the undichotomized frequency table outputs and for estimates of level, how close the output frequencies from the various versions of TRIM2 corresponded to those from the 1987 quality control data. Table 7 presents the χ2 test of independence for a 2×r contingency table, where r is the number of categories in the response, in which one row contains the frequencies from one model version and the second row contains those from the quality control data for 1987. Notationally, the test is as follows:

where nij is the number of persons in category j estimated by run i; ni. is the sum over categories—the total number of people “produced” by run i; n.j is the sum over runs—the total number of people in a certain category; and N is the total number of people “produced” by the two runs (in which one run, in this case, involves the quality control comparison values). The quantity Q

Suggested Citation: "Analysis of Categorical Data." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

TABLE 7 χ2 Goodness-of-Fit Statistics for Distributions, from TRIM2 Validation Experiment

Variable

Run Identification

1

2

3

4

5

6

7

8

Total no. in unit

26

37

28

28

27

36

28

27

No. of adults

10

27

39

45

11

27

41

45

No. of children

2

5

6

6

2

4

5

5

Age of youngest child

11

14

12

12

11

16

13

12

Gross income of unit

568

508

461

396

506

456

423

362

Earnings of adults

15

19

18

8

14

26

23

17

Type of AFDC unit

15

3

13

15

17

3

12

14

Race of head

4

9

2

2

6

10

2

2

Sex of head

28

19

1

0

31

21

1

0

Age of head

30

47

45

48

29

49

45

46

Relationship of unit head to household head

1.2

0.9

0.9

0.9

1.2

0.9

0.9

0.9

Marital status of head

1

1

15

20

1

2

16

19

Size of benefit

85

114

101

106

87

116

108

109

aD.F. indicates degrees of freedom. The χ2 values at the 99 percent confidence limit are as follows (a higher value in the table indicates that a model version differs from the 1987 IQCS by an amount greater than one could expect by chance):

D.F.=1, χ2=6.635

D.F.=2, χ2=9.210

D.F.=3, χ2=11.341

D.F.=4, χ2=13.277

D.F.=5, χ2=15.086

D.F.=6, χ2=16.812

D.F.=7, χ2=18.475

D.F.=8, χ2=20.090

has a chi-square distribution with r−1 degrees of freedom. Values of Q are provided in Table 7 for all 16 model versions and for the 1983 quality control data for a variety of model outputs.

In examining Table 7, as in the analysis of change, it is seen that no particular model version has any noticeable advantage over the other versions. Clearly, all model versions perform well in projecting the distribution of the relationship of unit head to household head, number of children, and race of head of unit. On the other hand, all model versions perform poorly for gross income of unit. There are some specific findings that remain to be confirmed. One is that adjustment appears to be useful to match the distribution of the comparison values for age of the youngest child. Full aging appears to be useful for sex of head, but not useful for marital status. This type of detailed analysis is clearly only suggestive and can be overdone since the study is

Suggested Citation: "Analysis of Categorical Data." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.

9

10

11

12

13

14

15

16

IQCS83

D.F.a

48

19

40

40

49

49

42

41

73

4

33

8

45

48

35

39

48

51

438

2

4

6

6

6

6

5

6

6

41

3

12

10

5

5

10

7

6

6

77

4

436

508

362

310

419

412

338

287

259

8

16

38

8

13

20

24

20

29

426

8

9

0

5

6

12

7

5

6

75

2

7

16

6

8

5

5

5

6

493

3

14

3

0

0

20

19

0

0

48

1

115

101

40

40

32

38

39

40

272

7

1.3

1.1

0.9

0.9

1.2

0.9

1.0

1.0

0

4

1

4

20

23

0

1

21

24

57

1

50

85

120

119

99

127

121

119

2015

7

limited. However, it is clear that the variability attributed to the choice of response variable dominates the variability due to model versions.

The last column of Table 7 displays the results from using the 1983 quality control data, which as mentioned above is not a completely fair comparison given the size of the policy change examined. Nevertheless, the 1983 quality control data do not compete well with the 16 versions of TRIM2 in the analysis shown in Table 7. However, they do outperform many TRIM2 versions in estimating gross income of unit and relationship of unit head to household head.

This test can also be used for more than two rows to investigate the similarity of several model versions as part of an analysis of the sensitivity of the distributions to model structure. For a categorical response with four categories, one could form the 16×4 contingency table and evaluate Q. However, this analysis would ignore the special structure that the 16 models have. There is no entirely satisfactory way, currently, to handle what is essentially an analysis of variance of frequency distributions. One can use dichotomization, as done above. Another way of partially circumventing the problem is to separately analyze particular subsets of the 16 model versions by using the test for independence and to look for similarities and differences in the separate analyses.

Suggested Citation: "Analysis of Categorical Data." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 295
Suggested Citation: "Analysis of Categorical Data." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 296
Suggested Citation: "Analysis of Categorical Data." National Research Council. 1991. Improving Information for Social Policy Decisions -- The Uses of Microsimulation Modeling: Volume II, Technical Papers. Washington, DC: The National Academies Press. doi: 10.17226/1853.
Page 297
Next Chapter: MAJOR CONCLUSIONS
Subscribe to Emails from the National Academies
Stay up to date on activities, publications, and events by subscribing to email updates.