In this chapter, we document our observations and findings about the procedures used to develop the performance levels for the 1992 National Adult Literacy Survey (NALS). The chapter begins with some background information on how performance levels and the associated cut scores are typically determined. We then provide a brief overview of the test development process used for NALS, as it relates to the procedures for determining performance levels, and describe how the performance levels were determined and the cut scores set. The chapter also includes a discussion of the role of response probabilities in setting cut scores and in identifying assessment tasks to exemplify performance levels; the technical notes at the end of the chapter provides additional details about this topic.
When the objective of a test is to report results using performance levels, the number of levels and the descriptions of the levels are usually articulated early in the test development process and serve as the foundation for test development. The process of determining the number of levels and their descriptions usually involves consideration of the content and skills evaluated on the test as well as discussions with stakeholders about the inferences to be based on the test results and the ways the test results will be used. When the number of levels and the descriptions of the levels are laid out in advance, development efforts can focus on constructing items that measure the content and skills described by the levels. It is important to develop a sufficient number of items that measure the skills
described by each of the levels. This allows for more reliable estimates of test-takers’ skills and more accurate classification of individuals into the various performance levels.
While determination of the performance-level descriptions is usually completed early in the test development process, determination of the cut scores between the performance levels is usually made after the test has been administered and examinees’ answers are available. Typically, the process of setting cut scores involves convening a group of panelists with expertise in areas relevant to the subject matter covered on the test and familiarity with the test-taking population, who are instructed to make judgments about what test takers need to know and be able to do (e.g., which test items individuals should be expected to answer correctly) in order to be classified into a given performance level. These judgments are used to determine the cut scores that separate the performance levels.
Methods for setting cut scores are used in a wide array of assessment contexts, from the National Assessment of Educational Progress (NAEP) and state-sponsored achievement tests, in which procedures are used to determine the level of performance required to classify students into one of several performance levels (e.g., basic, proficient, or advanced), to licensing and certification tests, in which procedures are used to determine the level of performance required to pass such tests in order to be licensed or certified.
There is a broad literature on procedures for setting cut scores on tests. In 1986, Berk documented 38 methods and variations on these methods, and the literature has grown substantially since. All of the methods rely on panels of judges, but the tasks posed to the panelists and the procedures for arriving at the cut scores differ. The methods can be classified as test-centered, examinee-centered, and standards-centered.
The modified Angoff and bookmark procedures are two examples of test-centered methods. In the modified Angoff procedure, the task posed to the panelists is to imagine a typical minimally competent examinee and to decide on the probability that this hypothetical examinee would answer each item correctly (Kane, 2001). The bookmark method requires placing all of the items in a test in order by difficulty; panelists are asked to place a “bookmark” at the point between the most difficult item borderline test takers would be likely to answer correctly and the easiest item borderline test takers would be likely to answer incorrectly (Zeiky, 2001).
The borderline group and contrasting group methods are two examples of examinee-centered procedures. In the borderline group method, the panelists are tasked with identifying examinees who just meet the performance standard; the cut score is set equal to the median score for these examinees (Kane, 2001). In the contrasting group method, the panelists are asked to categorize examinees into two groups—an upper group that has clearly met
the standard and a lower group that has not met the standard. The cut score is the score that best discriminates between the two groups.
The Jaeger-Mills integrated judgment procedure and the body of work procedure are examples of standards-centered methods. With these methods, panelists examine full sets of examinees’ responses and match the full set of responses to a performance level (Jaeger and Mills, 2001; Kingston et al., 2001). Texts such as Jaeger (1989) and Cizek (2001a) provide full descriptions of these and the other available methods.
Although the methods differ in their approaches to setting cut scores, all ultimately rely on judgments. The psychometric literature documents procedures for systematizing the process of obtaining judgments about cut scores (e.g., see Jaeger, 1989; Cizek, 2001a). Use of systematic and careful procedures can increase the likelihood of obtaining fair and reasoned judgments, thus improving the reliability and validity of the results. Nevertheless, the psychometric field acknowledges that there are no “correct” standards, and the ultimate judgments depend on the method used, the way it is carried out, and the panelists themselves (Brennan, 1998; Green, Trimble, and Lewis, 2003; Jaeger, 1989; Zieky, 2001).
The literature on setting cut scores includes critiques of the various methods that document their strengths and weaknesses. As might be expected, methods that have been used widely and for some time, such as the modified Angoff procedure, have been the subject of more scrutiny than recently developed methods like the bookmark procedure. A review of these critiques quickly reveals that there are no perfect or correct methods. Like the cut-score-setting process itself, choice of a specific procedure requires making an informed judgment about the most appropriate method for a given assessment situation. Additional information about methods for setting cut scores appears in Chapter 5, where we describe the procedures we used.
The NALS tasks were drawn from the contexts that adults encounter on a daily basis. As mentioned in Chapter 2, these contexts include work, home and family, health and safety, community and citizenship, consumer economics, and leisure and recreation. Some of the tasks had been used on the earlier adult literacy assessments (the Young Adult Literacy Survey in 1985 and the survey of job seekers in 1990), to allow comparison with the earlier results, and some were newly developed for NALS.
The tasks that were included on NALS were intended to profile and describe performance in each of the specified contexts. However, NALS was not designed to support inferences about the level of literacy adults need in order to function in the various contexts. That is, there was no
attempt to systematically define the critical literacy demands in each of the contexts. The test designers specifically emphasize this, saying: “[The literacy levels] do not reveal the types of literacy demands that are associated with particular contexts…. They do not enable us to say what specific level of prose, document, or quantitative skill is required to obtain, hold, or advance in a particular occupation, to manage a household, or to obtain legal or community services” (Kirsch et al., 1993, p. 9). This is an important point, because it demonstrates that some of the inferences made by policy makers and the media about the 1992 results were clearly not supported by the test development process and the intent of the assessment.
The approach toward test development used for NALS does not reflect typical procedures used when the objective of an assessment is to distinguish individuals with adequate levels of skills from those whose skills are inadequate. We point this out, not to criticize the process, but to clarify the limitations placed on the inferences that can be drawn about the results. To explain, it is useful to contrast the test development procedures used for NALS with procedures used in other assessment contexts, such as licensing and credentialing or state achievement testing.
Licensing and credentialing assessments are generally designed to distinguish between performance that demonstrates sufficient competence in the targeted knowledge, skills, and capabilities to be judged as passing and performance that is inadequate and judged as failing. Typically, licensing and certification tests are intentionally developed to distinguish between adequate and inadequate performance. The test development process involves specification of the skills critical to adequate performance generally determined by systematically collecting judgments from experts in the specific field (e.g., via surveys) about what a licensed practitioner needs to know and be able to do. The process for setting cut scores relies on expert judgments about just how much of the specific knowledge, skills, and capabilities is needed for a candidate to be placed in the passing category.
The process for test development and determining performance levels for state K-12 achievement tests is similar. Under ideal circumstances, the performance-level categories and their descriptions are determined in advance of or concurrent with item development, and items are developed to measure skills described by the performance levels. The process of setting the cut scores then focuses on determining the level of performance considered to be adequate mastery of the content and skills (often called “proficient”). Categories of performance below and above the proficient level are also often described to characterize the score distribution of the group of test takers.
The process for developing NALS and determining the performance levels was different. This approach toward test development does not—and was not intended to—provide the necessary foundation for setting stan-
dards for what adults need in order to adequately function in society, and there is no way to compensate for this after the fact. That is, there is no way to set a specific cut score that would separate adults who have sufficient literacy skills to function in society from those who do not. This does not mean that performance levels should not be used for reporting NALS results or that cut scores should not be set. But it does mean that users need to be careful about the inferences about the test results that can be supported and the inferences that cannot.
The process of determining performance levels for the 1992 NALS was based partially on analyses conducted on data from the two earlier assessments of adults’ literacy skills. The analyses focused on identifying the features of the assessment tasks and stimulus materials that contributed to the difficulty of the test questions. These analyses had been used to determine performance levels for the Survey of Workplace Literacy, the survey of job seekers conducted in 1990.1 The analyses conducted on the prior surveys were not entirely replicated for NALS. Instead, new analyses were conducted to evaluate the appropriateness of the performance levels and associated cut scores that had been used for the survey of job seekers. Based on these analyses, slight adjustments were made in the existing performance levels before adopting them for NALS. This process is described more fully below.
The first step in the process that ultimately led to the formulation of NALS performance levels was an in-depth examination of the items included on the Young Adult Literacy Survey and the Survey of Workplace Literacy, to identify the features judged to contribute to their complexity.2
For the prose literacy items, four features were judged to contribute to their complexity:
Type of match: whether finding the information needed to answer
|
1 |
The analyses were conducted on the Young Adult Literacy Survey but performance levels were not used in reporting its results. The analyses were partly replicated and extended to yield performance levels for the Survey of Workplace Literacy. |
|
2 |
See Chapter 13 of the NALS Technical Manual for additional details about the process (http://www.nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2001457). |
the question involved simply locating the answer in the text, cycling through the text iteratively, integrating multiple pieces of information, or generating new information based on prior knowledge.
Abstractness of the information requested.
Plausibility of distractors: the extent of and location of information related to the question, other than the correct answer, that appears in the stimulus.
Readability index as estimated using Fry’s (1977) readability index.
The features judged to contribute to the complexity of document literacy items were the same as for prose, with the exception that an index of the structural complexity of the display was substituted for the readability index. For the quantitative literacy items, the identified features included type of match and plausibility of the distractors, as with the prose items, and structural complexity, as with the document items, along with two other features:
Operation specificity: the process required for identifying the operation to perform and the numbers to manipulate.
Type of calculation: the type and number of arithmetic operations.
A detailed schema was developed for use in “scoring” items according to these features, and the scores were referred to as complexity ratings.
The next step in the process involved determination of the cut scores for the performance levels used for reporting results of the 1990 Survey of Workplace Literacy. The process involved rank-ordering the items according to a statistical estimate of their difficulty, which was calculated using data from the actual survey respondents. The items were listed in order from least to most difficult, and the judgment-based ratings of complexity were displayed on the listing. Tables 3-1 through 3-3, respectively, present the lists of prose, document, and quantitative items rank-ordered by difficulty level.
This display was visually examined for natural groupings or break points. According to Kirsch, Jungeblut, and Mosenthal (2001, p. 332), “visual inspection of the distribution of [the ratings] along each of the literacy scales revealed several major [break] points occurring at roughly 50 point intervals beginning with a difficulty score of 225 on each scale.”
The process of determining the break points was characterized as containing “some noise” and not accounting for all the score variance associated with performance on the literacy scales. It was noted that the shifts in complexity ratings did not necessarily occur at exactly 50 point intervals on the scales, but that assigning the exact range of scores to each level (e.g.,
TABLE 3-1 List of Prose Literacy Tasks, Along with RP80 Task Difficulty, IRT Item Parameters, and Values of Variables Associated with Task Difficulty: 1990 Survey of the Literacy of Job-Seekers
|
|
Identifier |
Task Description |
Scaled RP80 |
|
Level 1 |
A111301 |
Toyota, Acura, Nissan |
189 |
|
|
AB21101 |
Swimmer: Underline sentence telling what Ms. Chanin ate |
208 |
|
|
A120501 |
Blood donor pamphlet |
216 |
|
|
A130601 |
Summons for jury service |
237 |
|
Level 2 |
A120301 |
Blood donor pamphlet |
245 |
|
|
A100201 |
PHP subscriber letter |
249 |
|
|
A111401 |
Toyota, Acura, Nissan |
250 |
|
|
A121401 |
Dr. Spock column: Alterntv to phys punish |
251 |
|
|
AB21201 |
Swimmer: Age Ms. Chanin began to swim competitively |
250 |
|
|
A131001 |
Shadows Columbus saw |
280 |
|
|
AB80801 |
Illegal questions |
265 |
|
|
AB41001 |
Declaration: Describe what poem is about |
263 |
|
|
AB81101 |
New methods for capital gains |
277 |
|
|
AB71001 |
Instruction to return appliance: Indicate best note |
275 |
|
|
AB90501 |
Questions for new jurors |
281 |
|
|
AB90701 |
Financial security tips |
262 |
|
|
A130901 |
Shadows Columbus saw |
282 |
|
Level 3 |
AB60201 |
Make out check: Write letter explaining bill error |
280 |
|
|
AB90601 |
Financial security tips |
299 |
|
|
A121201 |
Dr. Spock column: Why phys punish accptd |
285 |
|
|
AB70401 |
Almanac vitamins: List correct info from almanac |
289 |
|
|
A100301 |
PHP subscriber letter |
294 |
|
|
A130701 |
Shadows Columbus saw |
298 |
|
|
A130801 |
Shadows Columbus saw |
303 |
|
|
AB60601 |
Economic index: Underline sentence explaining action |
305 |
|
|
A121301 |
Dr. Spock column: 2 cons against phys punish |
312 |
|
|
AB90401 |
Questions for new jurors |
300 |
|
|
AB80901 |
Illegal questions |
316 |
|
|
A111101 |
Toyota, Acura, Nissan |
319 |
|
IRT Parameters |
Readability |
Type of Match |
Distractor Plausibility |
Information Type |
||
|
a |
b |
c |
||||
|
0.868 |
–2.488 |
0.000 |
8 |
1 |
1 |
1 |
|
1.125 |
–1.901 |
0.000 |
8 |
1 |
1 |
1 |
|
0.945 |
–1.896 |
0.000 |
7 |
1 |
1 |
2 |
|
1.213 |
–1.295 |
0.000 |
7 |
3 |
2 |
2 |
|
0.956 |
–1.322 |
0.000 |
7 |
1 |
2 |
3 |
|
1.005 |
–1.195 |
0.000 |
10 |
3 |
1 |
3 |
|
1.144 |
–1.088 |
0.000 |
8 |
3 |
2 |
4 |
|
1.035 |
–1.146 |
0.000 |
8 |
2 |
2 |
3 |
|
1.070 |
–1.125 |
0.000 |
8 |
3 |
4 |
2 |
|
1.578 |
–0.312 |
0.000 |
9 |
3 |
1 |
2 |
|
1.141 |
–0.788 |
0.000 |
6 |
3 |
2 |
2 |
|
0.622 |
–1.433 |
0.000 |
4 |
3 |
1 |
3 |
|
1.025 |
–0.638 |
0.000 |
7 |
4 |
1 |
3 |
|
1.378 |
–0.306 |
0.266 |
5 |
3 |
2 |
3 |
|
1.118 |
–0.493 |
0.000 |
6 |
4 |
2 |
1 |
|
1.563 |
–0.667 |
0.000 |
8 |
3 |
2 |
4 |
|
1.633 |
–0.255 |
0.000 |
9 |
3 |
4 |
1 |
|
1.241 |
–0.440 |
0.000 |
7 |
3 |
2 |
4 |
|
1.295 |
–0.050 |
0.000 |
8 |
2 |
2 |
4 |
|
1.167 |
–0.390 |
0.000 |
8 |
3 |
2 |
4 |
|
0.706 |
–0.765 |
0.000 |
7 |
3 |
4 |
1 |
|
0.853 |
–0.479 |
0.000 |
10 |
4 |
3 |
2 |
|
1.070 |
–0.203 |
0.000 |
9 |
3 |
2 |
3 |
|
0.515 |
–0.929 |
0.000 |
9 |
3 |
2 |
2 |
|
0.809 |
–0.320 |
0.000 |
10 |
3 |
2 |
4 |
|
0.836 |
–0.139 |
0.000 |
8 |
3 |
3 |
4 |
|
1.230 |
–0.072 |
0.000 |
6 |
4 |
2 |
3 |
|
0.905 |
–0.003 |
0.000 |
6 |
4 |
3 |
3 |
|
0.772 |
–0.084 |
0.000 |
8 |
4 |
3 |
2 |
|
|
Identifier |
Task Description |
Scaled RP80 |
|
Level 4 |
AB40901 |
Korean Jet: Give argument made in article |
329 |
|
|
A131101 |
Shadows Columbus saw |
332 |
|
|
AB90801 |
Financial security tips |
331 |
|
|
AB30601 |
Technology: Orally explain info from article |
333 |
|
|
AB50201 |
Panel: Determine surprising future headline |
343 |
|
|
A101101 |
AmerExp: 2 similarities in handling receipts |
346 |
|
|
AB71101 |
Explain difference between 2 types of benefits |
348 |
|
|
AB81301 |
New methods for capital gains |
355 |
|
|
A120401 |
Blood donor pamphlet |
358 |
|
|
AB31201 |
Dickinson: Describe what is expessed in poem |
363 |
|
|
AB30501 |
Technology: Underline sentence explaining action |
371 |
|
Level 5 |
AB81201 |
New methods for capital gains |
384 |
|
|
A111201 |
Toyota, Acura, Nissan |
404 |
|
|
A101201 |
AmExp: 2 diffs in handling receipts |
441 |
|
|
AB50101 |
Panel: Find information from article |
469 |
TABLE 3-2 List of Document Literacy Tasks, Along with RP80 Task Difficulty Score, IRT Item Parameters, and Values of Variables Associated with Task Difficulty (structural complexity, type of match, plausibility of distractor, type of information): 1990 Survey of the Literacy of Job-Seekers
|
|
Identifier |
Task Description |
RP80 |
|
Level 1 |
SCOR100 |
Social Security card: Sign name on line |
70 |
|
|
SCOR300 |
Driver’s license: Locate expiration date |
152 |
|
|
SCOR200 |
Traffic signs |
176 |
|
|
AB60803 |
Nurses’ convention: What is time of program? |
181 |
|
|
AB60802 |
Nurses’ convention: What is date of program? |
187 |
|
|
SCOR400 |
Medicine dosage |
186 |
|
|
AB71201 |
Mark correct movie from given information |
189 |
|
|
A110501 |
Registration & tuition info |
189 |
|
|
AB70104 |
Job application: Complete personal information |
193 |
|
|
AB60801 |
Nurses’ convention: Write correct day of program |
199 |
|
|
SCOR500 |
Theatre trip information |
197 |
|
IRT Parameters |
Readability |
Type of Match |
Distractor Plausibility |
Information Type |
||
|
a |
b |
c |
||||
|
0.826 |
0.166 |
0.000 |
10 |
4 |
4 |
4 |
|
0.849 |
0.258 |
0.000 |
9 |
5 |
4 |
1 |
|
0.851 |
0.236 |
0.000 |
8 |
5 |
5 |
2 |
|
0.915 |
0.347 |
0.000 |
8 |
4 |
4 |
4 |
|
1.161 |
0.861 |
0.196 |
13 |
4 |
4 |
4 |
|
0.763 |
0.416 |
0.000 |
8 |
4 |
2 |
4 |
|
0.783 |
0.482 |
0.000 |
9 |
6 |
2 |
5 |
|
0.803 |
0.652 |
0.000 |
7 |
5 |
5 |
3 |
|
0.458 |
–0.056 |
0.000 |
7 |
4 |
5 |
2 |
|
0.725 |
0.691 |
0.000 |
6 |
6 |
2 |
4 |
|
0.591 |
0.593 |
0.000 |
8 |
6 |
4 |
4 |
|
0.295 |
–0.546 |
0.000 |
7 |
2 |
4 |
2 |
|
0.578 |
1.192 |
0.000 |
8 |
8 |
4 |
5 |
|
0.630 |
2.034 |
0.000 |
8 |
7 |
5 |
5 |
|
0.466 |
2.112 |
0.000 |
13 |
6 |
5 |
4 |
|
IRT Parameters |
Complexity |
Type of Match |
Distractor Plausibility |
Information Type |
||
|
a |
b |
c |
||||
|
0.505 |
–4.804 |
0.000 |
1 |
1 |
1 |
1 |
|
0.918 |
–2.525 |
0.000 |
2 |
1 |
2 |
1 |
|
0.566 |
–2.567 |
0.000 |
1 |
1 |
1 |
1 |
|
1.439 |
–1.650 |
0.000 |
1 |
1 |
1 |
1 |
|
1.232 |
–1.620 |
0.000 |
1 |
1 |
1 |
1 |
|
0.442 |
–2.779 |
0.000 |
2 |
1 |
2 |
2 |
|
0.940 |
–1.802 |
0.000 |
8 |
2 |
2 |
1 |
|
0.763 |
–1.960 |
0.000 |
3 |
1 |
2 |
2 |
|
0.543 |
–2.337 |
0.000 |
1 |
2 |
1 |
2 |
|
1.017 |
–1.539 |
0.000 |
1 |
1 |
2 |
1 |
|
0.671 |
–1.952 |
0.000 |
2 |
1 |
2 |
2 |
|
|
Identifier |
Task Description |
RP80 |
|
|
AB60301 |
Phone message: Write correct name of caller |
200 |
|
|
AB60302 |
Phone message: Write correct number of caller |
202 |
|
|
AB80301 |
How companies share market |
203 |
|
|
AB60401 |
Food coupons |
204 |
|
|
AB60701 |
Nurses’ convention: Who would be asked questions |
206 |
|
|
A120601 |
MasterCard/Visa statement |
211 |
|
|
AB61001 |
Nurses’ convention: Write correct place for tables |
217 |
|
|
A110301 |
Dessert recipes |
216 |
|
|
AB70903 |
Checking deposit: Enter correct amount of check |
223 |
|
|
AB70901 |
Checking deposit: Enter correct date |
224 |
|
|
AB50801 |
Wage & tax statement: What is current net pay? |
224 |
|
|
A130201 |
El Paso Gas & Electric bill |
223 |
|
Level 2 |
AB70801 |
Classified: Match list with coupons |
229 |
|
|
AB30101 |
Street map: Locate intersection |
232 |
|
|
AB30201 |
Sign out sheet: Respond to call about resident |
232 |
|
|
AB40101 |
School registration: Mark correct age information |
234 |
|
|
A131201 |
Tempra dosage chart |
233 |
|
|
AB31301 |
Facts about fire: Mark information in article |
235 |
|
|
AB80401 |
How companies share market |
236 |
|
|
AB60306 |
Phone message: Write whom message is for |
237 |
|
|
AB60104 |
Make out check: Enter correct amount written out |
238 |
|
|
AB21301 |
Bus schedule |
238 |
|
|
A110201 |
Dessert recipes |
239 |
|
|
AB30301 |
Sign out sheet: Respond to call about resident |
240 |
|
|
AB30701 |
Major medical: Locate eligibility from table |
245 |
|
|
AB60103 |
Make out check: Enter correct amount in numbers |
245 |
|
|
AB60101 |
Make out check: Enter correct date on check |
246 |
|
|
AB60102 |
Make out check: Paid to the correct place |
246 |
|
|
AB50401 |
Catalog order: Order product one |
247 |
|
|
AB60303 |
Phone message: Mark “please call” box |
249 |
|
|
AB50701 |
Almanac football: Explain why an award is given |
254 |
|
|
AB20101 |
Energy graph: Find answer for given conditions (1) |
255 |
|
|
A120901 |
MasterCard/Visa statement |
257 |
|
|
A130101 |
El Paso Gas & Electric bill |
257 |
|
|
AB91101 |
Minimum wage power |
260 |
|
|
AB81001 |
Consumer Reports books |
261 |
|
|
AB90101 |
Pest control warning |
261 |
|
|
AB21501 |
With graph, predict sales for spring 1985 |
261 |
|
|
AB20601 |
Yellow pages: Find place open Saturday |
266 |
|
|
A130401 |
El Paso Gas & Electric bill |
270 |
|
|
AB70902 |
Checking deposit: Enter correct cash amount |
271 |
|
IRT Parameters |
Complexity |
Type of Match |
Distractor Plausibility |
Information Type |
||
|
a |
b |
c |
||||
|
1.454 |
–1.283 |
0.000 |
1 |
1 |
2 |
1 |
|
1.069 |
–1.434 |
0.000 |
1 |
1 |
1 |
1 |
|
1.292 |
–1.250 |
0.000 |
7 |
2 |
2 |
2 |
|
0.633 |
–1.898 |
0.000 |
3 |
2 |
2 |
1 |
|
1.179 |
–1.296 |
0.000 |
1 |
2 |
2 |
1 |
|
0.997 |
–1.296 |
0.000 |
6 |
1 |
2 |
2 |
|
0.766 |
–1.454 |
0.000 |
1 |
1 |
2 |
2 |
|
1.029 |
–1.173 |
0.000 |
5 |
3 |
2 |
1 |
|
1.266 |
–0.922 |
0.000 |
3 |
2 |
2 |
1 |
|
0.990 |
–1.089 |
0.000 |
3 |
1 |
1 |
1 |
|
0.734 |
–1.366 |
0.000 |
5 |
2 |
2 |
2 |
|
1.317 |
–0.868 |
0.000 |
8 |
1 |
2 |
2 |
|
1.143 |
–0.881 |
0.000 |
8 |
2 |
3 |
1 |
|
0.954 |
–0.956 |
0.000 |
4 |
2 |
2 |
2 |
|
0.615 |
–1.408 |
0.000 |
2 |
3 |
2 |
1 |
|
0.821 |
–1.063 |
0.000 |
6 |
2 |
2 |
3 |
|
1.005 |
–0.872 |
0.000 |
5 |
2 |
3 |
3 |
|
0.721 |
–1.170 |
0.000 |
1 |
2 |
3 |
2 |
|
1.014 |
–0.815 |
0.000 |
7 |
3 |
2 |
2 |
|
0.948 |
–0.868 |
0.000 |
1 |
2 |
3 |
1 |
|
1.538 |
–0.525 |
0.000 |
6 |
3 |
2 |
1 |
|
0.593 |
–1.345 |
0.000 |
2 |
2 |
3 |
2 |
|
0.821 |
–0.947 |
0.000 |
5 |
3 |
2 |
1 |
|
0.904 |
–0.845 |
0.000 |
2 |
2 |
2 |
3 |
|
0.961 |
–0.703 |
0.000 |
4 |
2 |
2 |
2 |
|
0.993 |
–0.674 |
0.000 |
6 |
3 |
2 |
1 |
|
1.254 |
–0.497 |
0.000 |
6 |
3 |
2 |
1 |
|
1.408 |
–0.425 |
0.000 |
6 |
3 |
2 |
1 |
|
0.773 |
–0.883 |
0.000 |
8 |
3 |
2 |
1 |
|
0.904 |
–0.680 |
0.000 |
1 |
2 |
2 |
2 |
|
1.182 |
–0.373 |
0.000 |
6 |
2 |
2 |
3 |
|
1.154 |
–0.193 |
0.228 |
4 |
3 |
2 |
1 |
|
0.610 |
–0.974 |
0.000 |
6 |
1 |
2 |
2 |
|
0.953 |
–0.483 |
0.000 |
8 |
2 |
2 |
2 |
|
0.921 |
–0.447 |
0.000 |
4 |
3 |
3 |
2 |
|
1.093 |
–0.304 |
0.000 |
4 |
3 |
2 |
1 |
|
0.889 |
–0.471 |
0.000 |
2 |
3 |
3 |
2 |
|
0.799 |
–0.572 |
0.000 |
5 |
3 |
2 |
2 |
|
1.078 |
–0.143 |
0.106 |
7 |
3 |
2 |
1 |
|
0.635 |
–0.663 |
0.000 |
8 |
3 |
3 |
2 |
|
0.858 |
–0.303 |
0.000 |
3 |
3 |
3 |
2 |
|
|
Identifier |
Task Description |
RP80 |
|
Level 3 |
AB50601 |
Almanac football: Locate page of info in almanac |
276 |
|
|
A110701 |
Registration & tuition info |
277 |
|
|
AB20201 |
Energy graph: Find answer for given conditions (2) |
278 |
|
|
AB31101 |
Abrasive gd: Can product be used in given case? |
280 |
|
|
AB80101 |
Burning out of control |
281 |
|
|
AB70701 |
Follow directions on map: Give correct location |
284 |
|
|
A110801 |
Washington/Boston schedule |
284 |
|
|
AB70301 |
Almanac vitamins: Locate list of info in almanac |
287 |
|
|
AB20401 |
Yellow pages: Find a list of stores |
289 |
|
|
AB20501 |
Yellow pages: Find phone number of given place |
291 |
|
|
AB60305 |
Phone message: Write who took the message |
293 |
|
|
AB30401 |
Sign out sheet: Respond to call about resident (2) |
297 |
|
|
AB31001 |
Abrasive guide: Type of sandpaper for sealing |
304 |
|
|
AB20301 |
Energy: Yr 2000 source prcnt power larger than 71 |
307 |
|
|
AB90901 |
U.S. Savings Bonds |
308 |
|
|
AB60304 |
Phone message: Write out correct message |
310 |
|
|
AB81002 |
Consumer Reports books |
311 |
|
|
AB20801 |
Bus schd: Take correct bus for given condition (2) |
313 |
|
|
AB50402 |
Catalog order: Order product two |
314 |
|
|
AB40401 |
Almanac: Find page containing chart for given info |
314 |
|
|
AB21001 |
Bus schd: Take correct bus for given condition (4) |
315 |
|
|
AB60502 |
Petroleum graph: Complete graph including axes |
318 |
|
|
A120701 |
MasterCard/Visa statement |
320 |
|
|
AB20701 |
Bus schd: Take correct bus for given condition (1) |
324 |
|
Level 4 |
A131301 |
Tempra dosage chart |
326 |
|
|
AB50501 |
Telephone bill: Mark information on bill |
330 |
|
|
AB91401 |
Consumer Reports index |
330 |
|
|
AB30801 |
Almanac: Find page containing chart for given info |
347 |
|
|
AB20901 |
Bus schd: After 2:35, how long til Flint&Acad bus |
348 |
|
|
A130301 |
El Paso Gas & Electric bill |
362 |
|
|
A120801 |
MasterCard/Visa statement |
363 |
|
|
AB91301 |
Consumer Reports index |
367 |
|
Level 5 |
AB60501 |
Petroleum graph: Label axes of graph |
378 |
|
|
AB30901 |
Almanac: Determine pattern in exports across years |
380 |
|
|
A100701 |
Spotlight economy |
381 |
|
|
A100501 |
Spotlight economy |
386 |
|
|
A100401 |
Spotlight economy |
406 |
|
|
AB51001 |
Income tax table |
421 |
|
|
A100601 |
Spotlight economy |
465 |
|
IRT Parameters |
Complexity |
Type of Match |
Distractor Plausibility |
Information Type |
||
|
a |
b |
c |
||||
|
1.001 |
–0.083 |
0.000 |
5 |
3 |
2 |
2 |
|
0.820 |
–0.246 |
0.000 |
3 |
2 |
5 |
2 |
|
0.936 |
–0.023 |
0.097 |
4 |
4 |
2 |
1 |
|
0.762 |
–0.257 |
0.000 |
10 |
5 |
2 |
3 |
|
0.550 |
–0.656 |
0.000 |
2 |
3 |
2 |
2 |
|
0.799 |
–0.126 |
0.000 |
4 |
4 |
2 |
2 |
|
0.491 |
–0.766 |
0.000 |
9 |
2 |
4 |
2 |
|
0.754 |
–0.134 |
0.000 |
5 |
3 |
4 |
2 |
|
0.479 |
–0.468 |
0.144 |
7 |
2 |
5 |
1 |
|
0.415 |
–0.772 |
0.088 |
7 |
2 |
4 |
2 |
|
0.640 |
–0.221 |
0.000 |
1 |
5 |
2 |
1 |
|
0.666 |
–0.089 |
0.000 |
2 |
2 |
1 |
4 |
|
0.831 |
0.285 |
0.000 |
10 |
4 |
2 |
2 |
|
1.090 |
0.684 |
0.142 |
4 |
4 |
2 |
1 |
|
0.932 |
0.479 |
0.000 |
6 |
4 |
4 |
2 |
|
0.895 |
0.462 |
0.000 |
1 |
5 |
2 |
3 |
|
0.975 |
0.570 |
0.000 |
4 |
3 |
5 |
2 |
|
1.282 |
0.902 |
0.144 |
10 |
3 |
5 |
2 |
|
1.108 |
0.717 |
0.000 |
8 |
4 |
4 |
3 |
|
0.771 |
0.397 |
0.000 |
5 |
4 |
3 |
2 |
|
0.730 |
0.521 |
0.144 |
10 |
3 |
4 |
2 |
|
1.082 |
0.783 |
0.000 |
10 |
6 |
2 |
2 |
|
0.513 |
–0.015 |
0.000 |
6 |
2 |
4 |
2 |
|
0.522 |
0.293 |
0.131 |
10 |
3 |
4 |
2 |
|
0.624 |
0.386 |
0.000 |
5 |
4 |
4 |
2 |
|
0.360 |
–0.512 |
0.000 |
7 |
4 |
4 |
2 |
|
0.852 |
0.801 |
0.000 |
7 |
3 |
5 |
3 |
|
0.704 |
0.929 |
0.000 |
5 |
4 |
5 |
2 |
|
1.169 |
1.521 |
0.163 |
10 |
5 |
4 |
2 |
|
0.980 |
1.539 |
0.000 |
8 |
5 |
4 |
5 |
|
0.727 |
1.266 |
0.000 |
6 |
5 |
4 |
2 |
|
0.620 |
1.158 |
0.000 |
7 |
4 |
5 |
3 |
|
1.103 |
1.938 |
0.000 |
11 |
7 |
2 |
5 |
|
0.299 |
0.000 |
0.000 |
7 |
5 |
5 |
3 |
|
0.746 |
1.636 |
0.000 |
10 |
5 |
5 |
2 |
|
0.982 |
1.993 |
0.000 |
10 |
5 |
5 |
5 |
|
0.489 |
1.545 |
0.000 |
10 |
5 |
5 |
2 |
|
0.257 |
0.328 |
0.000 |
9 |
4 |
5 |
2 |
|
0.510 |
2.737 |
0.000 |
10 |
7 |
5 |
2 |
TABLE 3-3 List of Quantitative Literacy Tasks, Along with RP80 Task Difficulty, IRT Item Parameters, and Values of Variables Associated with Task Difficulty (structural complexity, type of match, plausibility of distractors, type of calculation, and specificity of operation): 1990 Survey of the Literacy of Job-Seekers
|
|
Identifier |
Quantitative Literacy Items |
RP80 |
|
|
Level 1 |
AB70904 Enter total amount of both checks being deposited |
221 |
|
|
Level 2 |
AB50404 Catalog order: Shipping, handling, and total |
271 |
|
|
AB91201 |
Tempra coupon |
271 |
|
|
AB40701 |
Check ledger: Complete ledger (1) |
277 |
|
|
A121001 |
Insurance protection workform |
275 |
|
Level 3 |
AB90102 |
Pest control warning |
279 |
|
|
AB40702 |
Check ledger: Complete ledger (2) |
281 |
|
|
AB40703 |
Check ledger: Complete ledger (3) |
282 |
|
|
A131601 |
Money rates: Thursday vs. one year ago |
281 |
|
|
AB40704 |
Check ledger: Complete ledger (4) |
283 |
|
|
AB80201 |
Burning out of control |
286 |
|
|
A110101 |
Dessert recipes |
289 |
|
|
AB90201 |
LPGA money leaders |
294 |
|
|
A120101 |
Businessland printer stand |
300 |
|
|
AB81003 |
Consumer Reports books |
301 |
|
|
AB80601 |
Valet airport parking discount |
307 |
|
|
AB40301 |
Unit price: Mark economical brand |
311 |
|
|
A131701 |
Money rates: Compare S&L w/mutual funds |
312 |
|
|
AB80701 |
Valet airport parking discount |
315 |
|
|
A100101 |
Pizza coupons |
316 |
|
|
AB90301 |
LPGA money leaders |
320 |
|
|
A110401 |
Dessert recipes |
323 |
|
|
A131401 |
Tempra dosage chart |
322 |
|
Level 4 |
AB40501 |
Airline schedule: Plan travel arrangements (1) |
326 |
|
|
AB70501 |
Lunch: Determine correct change using info in menu |
331 |
|
|
A120201 |
Businessland printer stand |
340 |
|
|
A110901 |
Washington/Boston train schedule |
340 |
|
|
AB60901 |
Nurses’ convention: Write number of seats needed |
346 |
|
|
AB70601 |
Lunch: Determine 10% tip using given info |
349 |
|
|
A111001 |
Washington/Boston train schedule |
355 |
|
|
A130501 |
El Paso Gas & Electric bill |
352 |
|
|
A100801 |
Spotlight economy |
356 |
|
IRT Parameters |
Complexity |
Type of Match |
Distractor Plausibility |
Calculation Type |
Op Specfy |
||
|
a |
b |
c |
|||||
|
0.869 |
–1.970 |
0.000 |
2 |
1 |
1 |
1 |
1 |
|
0.968 |
–0.952 |
0.000 |
6 |
3 |
2 |
1 |
3 |
|
0.947 |
–0.977 |
0.000 |
1 |
2 |
1 |
5 |
4 |
|
1.597 |
–0.501 |
0.000 |
3 |
2 |
2 |
1 |
4 |
|
0.936 |
–0.898 |
0.000 |
2 |
3 |
2 |
3 |
2 |
|
0.883 |
–0.881 |
0.000 |
2 |
3 |
3 |
1 |
4 |
|
1.936 |
–0.345 |
0.000 |
3 |
2 |
2 |
2 |
4 |
|
1.874 |
–0.332 |
0.000 |
3 |
1 |
2 |
2 |
4 |
|
1.073 |
–0.679 |
0.000 |
4 |
3 |
2 |
2 |
4 |
|
1.970 |
–0.295 |
0.000 |
3 |
2 |
2 |
2 |
4 |
|
0.848 |
–0.790 |
0.000 |
2 |
3 |
2 |
2 |
4 |
|
0.813 |
–0.775 |
0.000 |
5 |
3 |
2 |
2 |
4 |
|
0.896 |
–0.588 |
0.000 |
5 |
2 |
2 |
2 |
4 |
|
1.022 |
–0.369 |
0.000 |
2 |
3 |
3 |
2 |
4 |
|
0.769 |
–0.609 |
0.000 |
7 |
2 |
3 |
1 |
4 |
|
0.567 |
–0.886 |
0.000 |
2 |
3 |
3 |
2 |
4 |
|
0.816 |
0.217 |
0.448 |
2 |
2 |
3 |
4 |
6 |
|
1.001 |
–0.169 |
0.000 |
4 |
3 |
3 |
2 |
2 |
|
0.705 |
–0.450 |
0.000 |
2 |
2 |
3 |
3 |
4 |
|
0.690 |
–0.472 |
0.000 |
2 |
3 |
3 |
1 |
4 |
|
1.044 |
0.017 |
0.000 |
5 |
1 |
2 |
4 |
3 |
|
1.180 |
0.157 |
0.000 |
5 |
3 |
2 |
3 |
6 |
|
1.038 |
0.046 |
0.000 |
5 |
3 |
3 |
2 |
4 |
|
0.910 |
0.006 |
0.000 |
3 |
3 |
3 |
5 |
3 |
|
0.894 |
0.091 |
0.000 |
2 |
2 |
2 |
5 |
4 |
|
0.871 |
0.232 |
0.000 |
2 |
3 |
4 |
3 |
5 |
|
1.038 |
0.371 |
0.000 |
7 |
4 |
4 |
2 |
5 |
|
0.504 |
–0.355 |
0.000 |
3 |
4 |
4 |
1 |
5 |
|
0.873 |
0.384 |
0.000 |
2 |
1 |
2 |
5 |
7 |
|
0.815 |
0.434 |
0.000 |
7 |
4 |
4 |
2 |
5 |
|
0.772 |
0.323 |
0.000 |
8 |
3 |
4 |
2 |
2 |
|
0.874 |
0.520 |
0.000 |
8 |
5 |
4 |
2 |
2 |
|
|
Identifier |
Quantitative Literacy Items |
RP80 |
|
|
AB40201 |
Unit price: Estimate cost/oz of peanut butter |
356 |
|
|
A121101 |
Insurance protection workform |
356 |
|
|
A100901 |
Camp advertisement |
366 |
|
|
A101001 |
Camp advertisement |
366 |
|
|
AB80501 |
How companies share market |
371 |
|
Level 5 |
A131501 |
Tempra dosage chart |
381 |
|
|
AB50403 |
Catalog order: Order product three |
382 |
|
|
AB91001 |
U.S. Savings Bonds |
385 |
|
|
A110601 |
Registration & tuition info |
407 |
|
|
AB50301 |
Interest charges: Orally explain computation |
433 |
277-319 for Level 3 of document literacy; and 331-370 for Level 4 of quantitative literacy) would imply a level of precision of measurement that the test designers believed was inappropriate for the methodology adopted. Thus, identical score intervals were adopted for each of the three literacy scales as shown below:
Level 1:0–225
Level 2:226–275
Level 3:276–325
Level 4:326–375
Level 5:376–500
Performance-level descriptions were developed by summarizing the features of the items that had difficulty values that fell within each of the score ranges.
These procedures were not entirely replicated to determine the performance levels for NALS, in part because NALS used some of the items from the two earlier assessments. Instead, statistical estimates of test question difficulty levels were carried out for the newly developed NALS items (the items that had not been used on the earlier assessments), and the correlation between these difficulty levels and the item complexity ratings was determined. The test designers judged the correlations to be sufficiently similar to those from the earlier assessments and chose to use the same score scale breakpoints for NALS as had been used for the performance levels for the Survey of Workplace Literacy. Minor adjustments were made to the lan-
|
IRT Parameters |
Complexity |
Type of Match |
Distractor Plausibility |
Calculation Type |
Op Specfy |
||
|
a |
b |
c |
|||||
|
0.818 |
0.455 |
0.000 |
2 |
1 |
2 |
4 |
5 |
|
0.860 |
0.513 |
0.000 |
2 |
1 |
2 |
5 |
4 |
|
0.683 |
0.447 |
0.000 |
2 |
2 |
4 |
5 |
4 |
|
0.974 |
0.795 |
0.000 |
2 |
3 |
4 |
5 |
4 |
|
1.163 |
1.027 |
0.000 |
6 |
3 |
2 |
3 |
6 |
|
0.916 |
1.031 |
0.000 |
5 |
3 |
5 |
3 |
5 |
|
0.609 |
0.601 |
0.000 |
6 |
4 |
5 |
5 |
5 |
|
0.908 |
1.083 |
0.000 |
6 |
4 |
5 |
2 |
4 |
|
0.624 |
1.078 |
0.000 |
8 |
2 |
5 |
5 |
5 |
|
0.602 |
1.523 |
0.000 |
2 |
5 |
5 |
5 |
7 |
guage describing the existing performance levels. The resulting performance-level descriptions appear in Table 3-4.
The available written documentation about the procedures used for determining performance levels for NALS does not specify some of the more important details about the process (see Kirsch, Jungeblut, and Mosenthal, 2001, Chapter 13). For instance, it is not clear who participated in producing the complexity ratings or exactly how this task was handled. Determination of the cut scores involved examination of the listing of items for break points, but the break points are not entirely obvious. It is not clear that other people looking at this list would make the same choices for break points. In addition, it is not always clear whether the procedures described in the technical manual pertain to NALS or to one of the earlier assessments. A more open and public process combined with more explicit, transparent documentation is likely to lead to better understanding of how the levels were determined and what conclusions can be drawn about the results.
The performance levels produced by this approach were score ranges based on the cognitive processes required to respond to the items. While the 1992 score levels were used to inform a variety of programmatic decisions, there is a benefit to developing performance levels through open discussions with stakeholders. Such a process would result in levels that would be more readily understood.
The process for determining the cut scores for the performance levels
used for reporting NALS in 1992 did not involve one of the typical methods documented in the psychometric literature. This is not to criticize the test designers’ choice of procedures, as it appears that they were not asked to set standards for NALS, and hence one would not expect them to use one of these methods. It is our view, however, that there are benefits to using one or more of these documented methods. Use of established procedures for setting cut scores allows one to draw from the existing research and experiential base to gather information about the method, such as prescribed ways to implement the method, variations on the method, research on its advantages and disadvantages, and so on. In addition, use of established procedures facilitates communication with others about the general process. For example, if the technical manual for an assessment program indicates that the body of work method was used to set the cut scores, people can refer to the research literature for further details about what this typically entails.
The difficulty level of test questions can be estimated using a statistical procedure called item response theory (IRT). With IRT, a curve is estimated that gives the probability of a correct response from individuals across the range of proficiency. The curve is described in terms of parameters in a mathematical model. One of the parameter estimates, the difficulty parameter, typically corresponds to the score (or proficiency level) at which an individual has a 50 percent chance of answering the question correctly. Under this approach, it is also possible to designate, for the purposes of interpreting an item’s response curve, the proficiency at which the probability is any particular value that users find helpful. In 1992 the test developers chose to calculate test question difficulty values representing the proficiency level at which an individual had an 80 percent chance of answering an item correctly. The items were rank-ordered according to this estimate of their difficulty levels. Thus, the scaled scores used in determining the score ranges associated with the five performance levels were the scaled scores associated with an 80 percent probability of responding correctly.
The choice of the specific response probability value (e.g., 50, 65, or 80 percent) does not affect either the estimates of item response curves or distributions of proficiency. It is nevertheless an important decision because it affects users’ interpretations of the value of the scale scores used to separate the performance levels. Furthermore, due to the imprecision of the connection between the mathematical definitions of response probability values and the linguistic descriptions of their implications for performance
that judges use to set standards, the cut scores could be higher or lower simply as a consequence of the response probability selected. As mentioned earlier, the decision to use a response probability of 80 percent for the 1992 NALS has been the subject of subsequent debate, which has centered on whether the use of a response probability of 80 percent may have misrepresented the literacy levels of adults in the United States by producing cut scores that were too high (Baron, 2002; Kirsch, 2002; Kirsch et al., 2001, Ch. 14; Matthews, 2001; Sticht, 2004), to the extent that having a probability lower than 80 percent was misinterpreted as “not being able to do” the task required by an item.
In the final chapter of the technical manual (see Kirsch et al., 2001, Chapter 14), Kolstad demonstrated how the choice of a response probability value affects the value of the cut scores, under the presumption that response probability values might change considerably, while the everyday interpretation of the resulting numbers did not. He conducted a reanalysis of NALS data using a response probability value of 50 percent; that is, he calculated the difficulty of the items based on a 50 percent probability of responding correctly. This reanalysis demonstrated that use of a response probability value of 50 percent rather than 80 percent, with both interpreted by the same everyday language interpretation (e.g., that an individual at that level was likely to get an item correct), would have lowered the cut scores associated with the performance levels in such a way that a much smaller percentage of adults would have been classified at the lowest level. For example, the cut score based on a response probability of 80 placed slightly more than 20 percent of respondents in the lowest performance level; the cut score based on a response probability of 50 classified only 9 percent at this level.
It is important to point out here that the underlying distribution of scores did not change (and clearly could not change) with this reanalysis. There were no differences in the percentages of individuals scoring at each scale score. The only changes were the response probability criteria and interpretation of the cut scores. Using 80 percent as the response probability criterion, we would say that 20 percent of the population could perform the skills described by the first performance level with 80 percent accuracy. If the accuracy level was set at 50 percent and the same everyday language interpretation was applied, a larger share of the population could be said to perform these skills.
Like many decisions made in connection with developing a test, the choice of a specific response probability value requires both technical and nontechnical considerations. For example, a high response probability may
be adopted when the primary objective of the test is to certify, with a high degree of certainty, that test takers have mastered the content and skills. In licensing decisions, one would want to have a high degree of confidence that a potential license recipient has truly mastered the requisite subject matter and skills. When there are no high-stakes decisions associated with test results, a lower response probability value may be more appropriate.
Choice of a response probability value requires making a judgment, and reasonable people may disagree about which of several options is most appropriate. For this reason, it is important to lay out the logic behind the decision. It is not clear from the NALS Technical Manual (Kirsch et al., 2001) that the consequences associated with the choice of a response probability of 80 percent were fully explored or that other options were considered. Furthermore, the technical manual (Kirsch et al., 2001) contains contradictory information—one chapter that specifies the response probability value used and another chapter that demonstrates how alternate choices would have affected the resulting cut scores. Including contradictory information like this in a technical manual is very disconcerting to those who must interpret and use the assessment results.
It is our opinion that the choice of a response probability value to use in setting cut scores should be based on a thorough consideration of technical and nontechnical factors, such as the difficulty level of the test in relation to the proficiency level of the examinees, the objectives of the assessment, the ways the test results are used, and the consequences associated with these uses of test results. The logic and rationale for the choice should be clearly documented. Additional discussion of response probabilities appears in the technical note to this chapter, and we revisit the topic in Chapter 5.
Response probabilities are calculated for purposes other than determining cut scores. One of the most common uses of response probability values is to “map” items to specific score levels in order to more tangibly describe what it means to score at the specific level. For NALS, as described in the preceding section, the scale score associated with an 80 percent probability of responding correctly—abbreviated in the measurement literature as rp80—was calculated for each NALS item. Selected items were then mapped to the performance level whose associated score range encompassed the rp80 difficulty value. The choice of rp80 (as opposed to rp65, or some other value) appears to have been made both to conform to conventional item mapping practices at the time (e.g., NAEP used rp80 at the time, although it has since changed to rp67) and because it represents the concept of “mastery” as it is generally conceptualized in the field of education (Kirsch et al., 2001; personal communication, August 2004).
Item mapping is a useful tool for communicating about test performance. A common misperception occurs with its use, however: namely, that individuals who score at the specific level will respond correctly and those at lower levels will respond incorrectly. Much of the NALS results that were publicly reported displayed items mapped to only a single performance level, the level associated with a response probability of 80 percent. This all-or-nothing interpretation ignores the continuous nature of response probabilities. That is, for any given item, individuals at every score point have some probability of responding correctly.
Table 3-5, which originally appeared in Chapter 14 of the technical manual as Figure 14-4 (Kirsch et al., 2001), demonstrates this point using four sample NALS prose tasks. Each task is mapped to four different scale scores according to four different probabilities of a correct response (rp80, rp65, rp50, and rp35). Consider the first mapped prose task, “identify country in short article.” According to the figure, individuals who achieved a scaled score of 149 had an 80 percent chance of responding correctly; those who scored 123 had a 65 percent change of responding correctly; those with a score of 102 had a 50 percent chance of responding correctly; and those who scored 81 had a 35 percent chance of responding correctly.
Although those who worked on NALS had a rationale for selecting an rp80 criterion for use in mapping exemplary items to the performance levels, other response probability values might have been used and displays such as in Table 3-5 might have been prepared. If item mapping procedures are to be used in describing performance on NAAL, we encourage use of display more like that in Table 3-5. Additional information about item mapping appears in the technical note to this chapter. We also revisit this issue in Chapter 6, where we discuss methods of communicating about NAAL results.
Recommendation 3-1: If the Department of Education decides to use an item mapping procedure to exemplify performance on the National Assessment of Adult Literacy (NAAL), displays should demonstrate that individuals who score at all of the performance levels have some likelihood of responding correctly to the items.
As clearly stated by the test designers, the decision to collapse the NALS score distribution into five categories or ranges of performance was not done with the intent or desire to establish standards reflecting the extent of literacy skills that adults in the United States need or should have. Creating such levels was a means to convey the summary of performance on NALS.
Some of the more important details about the process were not specified in the NALS Technical Manual (Kirsch et al., 2001). Determination of the cut scores involved examination of the listing of items for break points, but the actual break points were not entirely obvious. It is not clear who participated in this process or how decisions were made. In addition, the choice of the response probability value of 80 percent is not fully documented. All of this suggests that one should not automatically accept the five NALS performance categories as the representation of defensible or justified levels of performance expectations.
The performance levels produced by the 1992 approach were groupings based on judgments about the complexity of the thinking processes required to respond to the items. While these levels might be useful for characterizing adults’ literacy skills, the process through which they were determined is not one that would typically be used to derive performance levels expected to inform policy interventions or to identify needed programs. It is the committee’s view that a more open, transparent process that relies on and utilizes stakeholder feedback is more likely to result in performance levels informative for the sorts of decisions expected to be based on the results.
Such a process is more in line with currently accepted practices for setting cut scores. The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999) specifically call for (1) clear documentation of the rationale and procedures used for establishing cut scores (Standard 4.19), (2) investigation of the relations between test scores and relevant criteria (Standard 4.20), and (3) designing the judgmental process so that judges can bring their knowledge and experience to bear in a reasonable way (Standard 4.21). We relied on this guidance offered by the Standards in designing our approach to developing performance levels and setting cut scores, which is the subject of the remainder of this report.
This technical note provides additional details about item response theory and response probabilities. The section begins with a brief introduction to the two-parameter item response model. This is followed by a discussion of how some of the features of item response models can be exploited to devise ways to map test items to scale score levels and further exemplify the skills associated with specified proficiency levels. The section
TABLE 3-5 Difficulty Values of Selected Tasks Along the Prose Literacy Scale, Mapped at Four Response Probability Criteria: The 1992 National Adult Literacy Survey
|
|
RP 80 |
RP 65 |
RP 50 |
RP 35 |
|
75 |
|
<81> Identify country in short articlea |
||
|
|
<102> Identify country in short articlea |
|
||
|
|
<123> Identify country in short articlea |
|
||
|
125 |
|
|||
|
|
<145> Underline sentence explaining action stated in short article |
|||
|
|
<149> Identify country in short articlea |
|
||
|
|
|
|
<169> Underline sentence explaining action stated in short article |
|
|
175 |
|
|||
|
|
<194> Underline sentence explaining action stated in short article |
|
||
|
|
<224> Underline sentence explaining action stated in short article |
|
||
|
225 |
|
|||
|
|
<255> State in writing an argument made in a long newspaper story |
|||
concludes with a discussion of factors to consider when selecting response probability values.
As mentioned above, IRT methodology was used for scaling the 1992 NALS items. While some of the equations and computations required by IRT are complicated, the underlying theoretical concept is actually quite straightforward, and the methodology provides some statistics very useful for interpreting assessment results. The IRT equation (referred to as the two-parameter logistic model, or 2-PL for short) used for scaling the 1992 NALS data appears below:
(3-1)
The left-hand side of the equation symbolizes the probability (P) of responding correctly to an item (e.g., item i) given a specified ability level (referred to as theta or θ). The right-hand side of the equation gives the mechanism for calculating the probability of responding correctly, where ai and bi are referred to as “item parameters,”3 and θ is the specified ability level. In IRT, this equation is typically used to estimate the probability that an individual, with a specified ability level θ, will correctly respond to an item. Alternatively, the probability P of a correct response can be specified along with the item parameters (ai and bi), and the equation can be solved for the value of theta associated with the specified probability value.
A hallmark of IRT is the way it describes the relation of the probability of an item response to scores on the scale reflecting the level of performance on the construct measured by the test. That description has two parts, as illustrated in Figure 3-1. The first part describes the population density, or distribution of persons over the variable being measured. For the illustration in Figure 3-1, the variable being measured is prose literacy as defined by the 1992 NALS. A hypothetical population distribution is shown in the upper panel of Figure 3-1, simulated as a normal distribution.4
FIGURE 3-1 Upper panel: Distribution of proficiency in the population for the prose literacy scale. Lower panel: The trace line, or item characteristic curve, for a sample prose item.
The second part of an IRT description of item performance is the trace line, or item characteristic curve. A trace line shows the probability of a correct response to an item as a function of proficiency (in this case, prose literacy). Such a curve is shown in the lower panel of Figure 3-1 for an item that is described as requiring “the reader to write a brief letter explaining that an error has been made on a credit card bill” (Kirsch et al., 1993, p. 78). For this item, the trace line in Figure 3-1 shows that people with prose literacy scale scores higher than 300 are nearly certain to respond correctly, while those with scores lower than 200 are nearly certain to fail. The
probability of a correct response rises relatively quickly as scores increase from 200 to 300.
Trace lines can be determined for each item on the assessment. The trace lines are estimated from the assessment data in a process called item calibration. Trace lines for the 39 open-ended items on the prose scale for the 1992 NALS are shown in Figure 3-2. The trace line shown in Figure 3-1 is one of those in the center of Figure 3-2. The variation in the trace lines for the different items in Figure 3-2 shows how the items vary in difficulty. Some trace lines are shifted to the left, indicating that lower scoring individuals have a high probability of responding correctly. Some trace lines are shifted to right, which means the items are more difficult and only very high-scoring individuals are likely to respond correctly.
As Figure 3-2 shows, some trace lines are steeper than others. The steeper the trace line, the more discriminating the item. That is, items with
FIGURE 3-2 Trace lines for the 39 open-ended items on the prose scale for the 1992 NALS.
FIGURE 3-3 Division of the 1992 NALS prose literacy scale into five levels.
higher discrimination values are better at distinguishing among test takers’ proficiency levels.
The collection of trace lines is used for several purposes. One purpose is the computation of scores for persons with particular patterns of item responses. Another purpose is to link the scales from repeated assessments. Such trace lines for items repeated between assessments were used to link the scale of the 1992 NALS to the 1985 Young Adult Literacy Survey. A similar linkage was constructed between the 1992 NALS and the 2003 NAAL.
In addition, the trace lines for each item may be used to describe how responses to the items are related to alternate reporting schemes for the literacy scale. For reporting purposes, the prose literacy scale for the 1992 NALS was divided into five levels using cut scores that are shown embedded in the population distribution in Figure 3-3. Using these levels for reporting, the proportion of the population scoring 225 or lower was said to be in Level 1, with the proportions in Levels 2, 3, and 4 representing score ranges of 50 points, and finally Level 5 included scores exceeding 375.
With a response probability (rp) criterion specified, it is possible to use the IRT model to “place” the items at some specific level on the scale. Placing an item at a specific level allows one to make statements or predictions about the likelihood that a person who scores at the level will answer the question correctly. For the 1992 NALS, items were placed at a specific
FIGURE 3-4 Scale scores associated with rp values of .50, .67, and .80 for a sample item from the NALS prose scale.
level as part of the process that was used to decide on the cut scores among the five levels and for use in reporting examples of items. For the 1992 NALS, an rp value of .80 was used. This means that each item was said to be “at” the value of the prose score scale for which the probability of a correct response was .80. For example, for the “write letter” item, it was said “this task is at 280 on the prose scale” (Kirsch et al., 1993, p. 78), as shown by the dotted lines in Figure 3-4.
Using these placements, items were said to be representative of what persons scoring in each level could do. Depending on where the item was placed within the level, it was noted whether an item was one of the easier or more difficult items in the level. For example, the “write letter” item was described as “one of the easier Level 3 tasks” (Kirsch, 1993, p. 78). These placements of items were also shown on item maps, such as the one that appeared on page 10 of Kirsch, 1993 (see Table 3-6); the purpose of the item maps is to aid in the interpretation of the meaning of scores on the scale and in the levels.
Some procedures, such as the bookmark standard-setting procedures, require the specification of an rp value to place the items on the scale. However, even when it is necessary to place an item at a specific point on the scale, it is important to remember that an item can be placed anywhere on the scale, with some rp value. For example, as illustrated in Figure 3-4, the “write letter” item is “at” 280 (and “in” Level 3, because that location is above 275) for an rp value of .80. However, this item is at 246, which places it in the lower middle of Level 2 (between 226 and 275) for an rp value of .50, and it is at 264, which is in the upper middle of Level 2 for an rp value of .67.
FIGURE 3-5 Percentage expected to answer the sample item correctly within each of the five levels of the 1992 NALS scale.
It should be emphasized that it is not necessary to place items at a single score location. For example, in reporting the results of the assessment, it is not necessary to say that an item is “at” some value (such as 280 for the “write letter” item).
Futhermore, there are more informative alternatives to placing items at a single score location. If an item is said to be “at” some scale value or “in” some level (as the “write letter” item is at 280 and in Level 3), it suggests that people scoring lower, or in lower levels, do not respond correctly. That is not the case. The trace line itself, as shown in Figure 3-4, reminds us that many people scoring in Level 2 (more than the upper half of those in Level 2) have a better than 50-50 chance of responding correctly to this item. A more accurate depiction of the likelihood of a correct response was presented in Appendix D of the 1992 technical manual (Kirsch et al., 2001). That appendix includes a representation of the trace line for each item at seven equally spaced scale scores between 150 and 450 (along with the rp80 value). This type of representation would allow readers to make inferences about this item much like those suggested by Figure 3-4.
Figure 3-5 shows the percentage expected to answer the “write letter” item in each of the five levels. These values can be computed from the IRT model (represented by equation 3-1), in combination with the population distribution.5 With access to the data, one can alternatively simply tabulate
|
5 |
They are the weighted average of the probabilities correct given by the trace line for each score within the level, weighted by the population density of persons at that score (in the upper panel of Figure 3-1). Using the Gaussian population distribution, those values are not extremely accurate for 1992 NALS; however, they are used here for illustrative purposes. |
the observed proportion of examinees who responded correctly at each reporting level. The latter has been done often in recent NAEP reports (e.g., The Nation’s Report Card: Reading 2002, http://www.nces.ed.gov/pubsearch/pubsinfo.asp?pubid=2003521, Chapter 4, pp. 102ff).
The values in Figure 3-5 show clearly how misconceptions can arise from statements such as “this item is ‘in’ Level 3” (using an rp value of .80). While the item may be “in” Level 3, 55 percent of people in Level 2 responded correctly. So statements such as “because the item is in Level 3, people scoring in Level 2 would respond incorrectly” are wrong. For reporting results using sets of levels, a graphical or numerical summary of the probability of a correct response at multiple points on the scale score, such as shown in Figure 3-5, is likely to be more informative and lead to more accurate interpretations.
As previously mentioned, for some purposes, such as the bookmark method of standard setting, it is essential that items be placed at a single location on the score scale. An rp value must be selected to accomplish that. The bookmark method of standard setting requires an “ordered item booklet” in which the items are placed in increasing order of difficulty. With the kinds of IRT models that are used for NALS and NAAL, different rp values place the items in different orders. For example, Figure 3-2 includes dotted lines that denote three rp values: rp80, rp67, and rp50. The item trace lines cross the dotted line representing an rp value of 80 percent in one sequence, while they cross the dotted line representing an rp value of 67 percent in another sequence, and they cross the dotted line representing an rp value of 50 percent in yet another sequence. There are a number of factors to consider in selecting an rp criterion.
One source of information on which to base the selection of an rp value involves empirical studies of the effects of different rp values on the standard-setting process (e.g., Williams and Schultz, 2005). Another source of information relevant to the selection of an rp value is purely statistical in nature, having to do with the relative precision of estimates of the scale scores associated with various rp values. To illustrate, Figure 3-6 shows the trace line for the “write letter” item as it passes through the middle of the prose score scale. The trace line is enclosed in dashed lines that represent the boundaries of a 95 percent confidence envelope for the curve. The confidence envelope for a curve is a region that includes the curves corresponding to the central 95 percent confidence interval for the (item) param-
FIGURE 3-6 A 95 percent confidence envelope for the trace line for the sample item on the NALS prose scale.
eters that produce the curve. That is, the confidence envelope translates statistical uncertainty (due to random sampling) in the estimation of the item parameters into a graphical display of the consequent uncertainty in the location of the trace line itself.6
A striking feature of the confidence envelope in Figure 3-6 is that it is relatively narrow. This is because the standard errors for the item parameters (reported in Appendix A of the 1992 NALS Technical Manual) are very small. Because the confidence envelope is very narrow, it is difficult to see in Figure 3-6 that it is actually narrower (either vertically or horizontally) around rp50 than it is around rp80. This means that there is less uncertainty associated with proficiency estimates based on rp50 than on rp80. While this finding is not evident in the visual display (Figure 3-6), it has been previously documented (see Thissen and Wainer, 1990, for illustrations of confidence envelopes that are not so narrow and show their characteristic asymmetries more clearly).
Nonetheless, the confidence envelope may be used to translate the uncertainty in the item parameter estimates into descriptions of the uncertainty of the scale scores corresponding to particular rp values. Using the “write letter” NALS item as an illustration, at rp50 the confidence envelope
|
6 |
For a more detailed description of confidence envelopes in the context of IRT, see Thissen and Wainer (1990), who use results obtained by Thissen and Wainer (1982) and an algorithm described by Hauck (1983) to produce confidence envelopes like the dashed lines in Figure 3-6. |
encloses trace lines that would place the corresponding scale score anywhere between 245 and 248 (as shown by the solid lines connected to the dotted line for 0.50 in Figure 3-6). That range of three points is smaller than the four-point range for rp67 (from 262 to 266), which is, in turn, smaller than the range for the rp80 scale score (278-283).7
The rp80 values, as used for reporting the 1992 NALS results, have statistical uncertainty that is almost twice as large (5 points, from 278 to 283, around the reported value of 280 for the “write letter” item) as the rp50 values (3 points, from 245 to 248, for this item). The rp50 values are always most precisely estimated. So a purely statistical answer to the question, “What rp value is most precisely estimated, given the data?” would be rp50 for the item response model used for the binary-scored open-ended items in NALS and NAAL. The statistical uncertainty in the scale scores associated with rp values simply increases as the rp value increases above 0.50. It actually becomes very large for rp values of 90, 95, or 99 percent (which is no doubt the reason such rp values are never considered in practice).
Nevertheless, the use of rp50 has been reported to be very difficult for judges in standard-setting processes, as well as other consumers, to interpret usefully (Williams and Schulz, 2004). What does it mean to say “the score at which the person has a 50-50 chance of responding correctly”? While that value may be useful (and interpretable) for a data analyst developing models for item response data, it is not so useful for consumers of test results who are more interested in ideas like “mastery.” An rp value of 67 percent, now commonly used in bookmark procedures (Mitzel et al., 2001), represents a useful compromise for some purposes. That is, the idea that there is a 2 in 3 chance that the examinee will respond correctly is readily interpretable as “more likely than not.” Furthermore, the statistical uncertainty of the estimate of the scale score associated with rp67 is larger than for rp50 but not as large as for rp80.
Figure 3-4 illustrates another statistical property of the trace lines used for NALS and NAAL that provides motivation for choosing an rp value closer to 50 percent. Note in Figure 3-2 that not only are the trace lines in a different (horizontal) order for rp values of 50, 67, and 80 percent, but they are also considerably more variable (more widely spread) at rp80 than
|
7 |
Some explanation is needed. First, the rp50 interval is actually symmetrical. Earlier (Figure 3-4), the rp50 value was claimed to be 246. The actual value, before rounding, is very close to 246.5, so the interval from 245 to 248 (which is rounded very little) is both correct and symmetrical. The intervals for the higher rp values are supposed to be asymmetrical. |
they are at rp50. These greater variations at rp80, and the previously described wider confidence envelope, are simply due to the inherent shape of the trace line. As it approaches a value of 1.0, it must flatten out and so it must develop a “shoulder” that has very uncertain location (in the left-right direction) for any particular value of the probability of a correct response (in the vertical direction). Figure 3-2 shows that variation in the discrimination of the items greatly accentuates the variation in the scale score location of high and low rp values.
Again, these kinds of purely statistical considerations would lead to a choice of rp50. Considerations of mastery for the presentation and description of the results to many audiences suggests higher rp values. A compromise value of rp67, combined with a reminder that the rp values are arbitrary values used in the standard-setting process, and reporting of the results can describe the likelihood or correct responses for any level or scale score, are what we suggest.