In this chapter, we detail the processes we used for developing descriptions of the performance levels as well as the methods we used to determine the cut scores to be associated with each of the performance levels. The performance-level descriptions were developed through an iterative process in which the descriptions evolved as we drafted wording, solicited feedback, reviewed the assessment frameworks and tasks, and made revisions. The process of determining the cut scores involved using procedures referred to as “standard setting,” which were introduced in Chapter 3.
As we noted in Chapter 3, standard setting is intrinsically judgmental. Science enters the process only as a way of ensuring the internal and external validity of informed judgments (e.g., that the instructions are clear and understood by the panelists; that the standards are statistically reliable and reasonably consistent with external data, such as levels of completed schooling). Given the judgmental nature of the task, it is not easy to develop methods and procedures that are scientifically defensible; indeed, standard-setting procedures have provoked considerable controversy (e.g., National Research Council [NRC], 1998; Hambleton et al., 2001). In developing our procedures, we have familiarized ourselves with these controversies and have relied on the substantial research base on standard setting1 and, in
particular, on the research on setting achievement levels for the National Assessment of Educational Progress (NAEP).
NAEP’s standard-setting procedures are perhaps the most intensely scrutinized procedures in existence today, having been designed, guided, and evaluated by some of the most prominent measurement experts in the county. The discussions about NAEP’s procedures, both the favorable comments and the criticisms, provide guidance for those designing a standard-setting procedure. We attempted to implement procedures that reflected the best of what NAEP does and that addressed the criticisms that have been leveled against NAEP’s procedures. Below we highlight the major criticisms and describe how we addressed them. We raise these issues, not to take sides on the various controversies, but to explain how we used this information to design our standard-setting methods.
NAEP has for sometime utilized the modified Angoff method for setting cut scores, a procedure that some consider to yield defensible standards (Hambleton and Bourque, 1991; Hambleton et al., 2000; Cizek, 1993, 2001a; Kane, 1993, 1995; Mehrens, 1995; Mullins and Green, 1994) and some believe to pose an overly complex cognitive task for judges (National Research Council, 1999; Shepard, Glaser, and Linn, 1993). While the modified Angoff method is still widely used, especially for licensing and certification tests, many other methods are available. In fact, although the method is still used for setting the cut scores for NAEP’s achievement levels, other methods are being explored with the assessment (Williams and Schulz, 2005). Given the unresolved controversies about the modified Angoff method, we chose not to use it. Instead, we selected a relatively new method, the bookmark standard-setting method, that appears to be growing in popularity. The bookmark method was designed specifically to reduce the cognitive complexity of the task posed to panelists (Mitzel et al., 2001). The procedure was endorsed as a promising method for use on NAEP (National Research Council, 1999) and, based on recent estimates, is used by more than half of the states in their K-12 achievement tests (Egan, 2001).
Another issue that has been raised in relation to NAEP’s standard-setting procedures is that different standard-setting methods were required for NAEP’s multiple-choice and open-ended items. The use of different methods led to widely disparate cut scores, and there has been disagreement and advantages and disadvantages, such as Jaegar’s article in Educational Measurement (1989)
|
|
and the collection of writings in Cizek’s (2001b) Setting Performance Standards. We frequently refer readers to these writings because they provide a convenient and concise means for learning more about standard setting; however, we do not intend to imply that these were the only documents consulted. |
about how to resolve these differences (Hambleton et al., 2000; National Research Council, 1999; Shepard, Glaser, and Linn, 1993). An advantage of the bookmark procedure is that it is appropriate for both item types. While neither the National Adult Literacy Survey (NALS) nor the National Assessment of Adult Literacy (NAAL) use multiple-choice items, both include open-ended items, some of which were scored as right or wrong and some of which were scored according to a partial credit scoring scheme (e.g., wrong, partially correct, fully correct). The bookmark procedure is suitable for both types of scoring schemes.
Another issue discussed in relation to NAEP’s achievement-level setting was the collection of evidence used to evaluate the reasonableness of the cut scores. Concerns were expressed about the discordance between cut scores that resulted from different standard-setting methods (e.g., the modified Angoff method and the contrasting groups method yielded different cut scores for the assessment) and the effect of these differences on the percentages of students categorized into each of the achievement levels. Concerns were also expressed about whether the percentages of students in each achievement level were reasonable given other indicators of students’ academic achievement in the United States (e.g., performance on the SAT, percentage of students enrolled in Advanced Placement programs), although there was considerable disagreement about the appropriateness of such comparisons. While we do not consider that our charge required us to resolve these disagreements about NAEP’s cut scores, we did try to address the criticisms.
As a first step to address these concerns, we used the background data available from the assessment as a means for evaluating the reasonableness of the bookmark cut scores. To accomplish this, we developed an adapted version of the contrasting groups method, which utilizes information about examinees apart from their actual test scores. This quasi-contrasting groups (QCG) approach was not used as a strict standard-setting technique but as a means for considering adjustments to the bookmark cut scores. While validation of the recommended cut scores should be the subject of a thorough research endeavor that would be beyond the scope of the committee’s charge, comparison of the cut scores to pertinent background data provides initial evidence.
We begin our discussion with an overview of the bookmark standard-setting method and the way we implemented it. Participants in the standard settings provided feedback on the performance-level descriptions, and we present the different versions of the descriptions and explain why they were revised. The results of the standard settings appear at the end of this chapter, where we also provide a description of the adapted version of the contrasting groups procedure that we used and make our recommendations for cut scores. The material in this chapter provides an overview of the
bookmark procedures and highlights the most crucial results from the standard setting; additional details about the standard setting are presented in Appendixes C and D.
Relatively new, the bookmark procedure was designed to simplify the judgmental task by asking panelists to directly set the cut scores, rather than asking them to make judgments about test questions in isolation, as in the modified Angoff method (Mitzel et al., 2001). The method has the advantage of allowing participants to focus on the content and skills assessed by the test questions rather than just on the difficulty of the questions, as panelists are given “item maps” that detail item content (Zieky, 2001). The method also provides an opportunity to revise performance-level descriptions at the completion of the standard-setting process so they are better aligned with the cut scores.
In a bookmark standard-setting procedure, test questions are presented in a booklet arranged in order from easiest to hardest according to their estimated level of difficulty, which is derived from examinees’ answers to the test questions. Panelists receive a set of performance-level descriptions to use while making their judgments. They review the test questions in these booklets, called “ordered item booklets,” and place a “bookmark” to demark the set of questions that examinees who have the skills described by a given performance level will be required to answer correctly with a given level of accuracy. To explain, using the committee’s performance-level categories, panelists would consider the description of skills associated with the basic literacy category and, for each test question, make a judgment about whether an examinee with these skills would be likely to answer the question correctly or incorrectly. Once the bookmark is placed for the first performance-level category, the panelists would proceed to consider the skills associated with the second performance-level category (intermediate) and place a second bookmark to denote the set of items that individuals who score in this category would be expected to answer correctly with a specified level of accuracy. The procedure is repeated for each of the performance-level categories.
The bookmark method requires specification of what it means to be “likely” to answer a question correctly. The designers of the method suggest that “likely” be defined as “67 percent of the time” (Mitzel et al., 2001, p. 260). This concept of “likely” is important because it is the response probability value used in calculating the difficulty of each test question (that is, the scale score associated with the item). Although a response probability of 67 percent (referred to as rp67) is common with the book-
mark procedure, other values could be used, and we address this issue in more detail later in this chapter.
To demonstrate how the response probability value is used in making bookmark judgments, we rely on the performance levels that we recommended in Chapter 4. Panelists first consider the description of the basic literacy performance level and the content and skills assessed by the first question in the ordered item booklet, the easiest question in the booklet. Each panelist considers whether an individual with the skills described in the basic category would have a 67 percent chance of answering this question correctly (or stated another way, if an individual with the skills described in the basic category would be likely to correctly answer a question measuring these specific skills two out of three times). If a panelist judges this to be true, he or she proceeds to the next question in the booklet. This continues until the panelist comes to a question that he or she judges a basic-level examinee does not have a 67 percent chance of answering correctly (or would not be likely to answer correctly two out of three times). The panelist places his or her bookmark for the basic level on this question. The panelist then moves to the description of the intermediate level and proceeds through the ordered item booklet until reaching an item that he or she judges an individual with intermediate-level skills would not be likely to answer correctly 67 percent of the time. The intermediate-level bookmark would be placed on this item. Determination of the placement of the bookmark for the advanced level proceeds in a similar fashion.
Panelists sit at a table with four or five other individuals who are all working with the same set of items, and the bookmark standard-setting procedure is implemented in an iterative fashion. There are three opportunities, or rounds, for panelists to decide where to place their bookmarks. Panelists make their individual decisions about bookmark placements during Round 1, with no input from other panelists. Afterward, panelists seated at the same table compare and discuss their ratings and then make a second set of judgments as part of Round 2. As part of the bookmark process, panelists discuss their bookmark placements, and agreement about the placements is encouraged. Panelists are not required to come to consensus about the placement of bookmarks, however.
After Round 2, bookmark placements are transformed to test scale scores, and the median scale score is determined for each performance level. At this stage, the medians are calculated by considering the bookmark placements for all panelists who are working on a given test booklet (e.g., all panelists at all tables who are working on the prose ordered item booklet).
Panelists are usually provided with information about the percentage of test takers whose scores would fall into each performance-level category based on these medians. This feedback is referred to as “impact data” and
serves as a reality check to allow panelists to adjust and fine-tune their judgments. Usually, all the panelists working on a given ordered item booklet assemble and review the bookmark placements, the resulting median scale scores, and the impact data together. Panelists then make a final set of judgments during Round 3, working individually at their respective tables.
The median scale scores are recalculated after the Round 3 judgments are made. Usually, mean scale scores are also calculated, and the variability in panelists’ judgments is examined to evaluate the extent to which they disagree about bookmark placements. At the conclusion of the standard setting, it is customary to allot time for panelists to discuss and write performance-level descriptions for the items reviewed during the standard setting.
The committee conducted two bookmark standard-setting sessions, one in July 2004 with data from the 1992 NALS and one in September 2004 with data from the 2003 NAAL. This allowed us to use two different groups of panelists, to try out our procedures with the 1992 data and then make corrections (as needed) before the standard setting with the 2003 data was conducted, and to develop performance-level descriptions that would generalize to both versions of the assessment. Richard Patz, one of the developers of the bookmark method, served as consultant to the committee and led the standard-setting sessions. Three additional consultants and National Research Council project staff assisted with the sessions, and several committee members observed the sessions. The agendas for the two standard-setting sessions appear in Appendixes C and D.
Because the issue of response probability had received so much attention in relation to NALS results (see Chapter 3), we arranged to collect data from panelists about the impact of using different instructions about response probabilities. This data collection was conducted during the July standard setting with the 1992 data and is described in the section of this chapter called “Bookmark Standard Setting with 1992 Data.”
The standard-setting sessions were organized to provide opportunity to obtain feedback on the performance-level descriptions. During the July session, time was provided for the panelists to suggest changes in the descriptions based on the placement of their bookmarks after the Round 3 judgments had been made. The committee reviewed their feedback, refined the descriptions, and in August invited several of the July panelists to review the revised descriptions. The descriptions were again refined, and a revised version was prepared for the September standard setting. An extended feedback session was held at the conclusion of the September standard setting to finalize the descriptions.
The July and September bookmark procedures were implemented in relation to the top four performance levels only—below basic, basic, intermediate, and advanced. This was a consequence of a decision made by the Department of Education during the development of NAAL. As mentioned in Chapter 2, in 1992, a significant number of people were unable to complete any of the NALS items and therefore produced test results that were clearly low but essentially unscorable. Rather than expanding the coverage of NAAL into low levels of literacy at the letter, word, and simple sentence level, the National Center for Education Statistics (NCES) chose to develop a separate low-level assessment, the Adult Literacy Supplemental Assessment (ALSA). ALSA items were not put on the same scale as the NAAL items or classified into the three literacy areas. As a result, we could not use the ALSA questions in the bookmark procedure. This created a de facto cut score between the nonliterate in English and below basic performance levels. Consequently, all test takers who performed poorly on the initial screening questions (the core questions) and were administered ALSA are classified into the nonliterate in English category.2
As a result, the performance-level descriptions used for the bookmark procedures included only the top four levels, and the skills evaluated on ALSA were incorporated into the below basic description. After the standard settings, each of the performance-level descriptions for the below basic category were revised, and the nonliterate in English category was formulated. The below basic description was split to separate the skills that individuals who took ALSA would be likely to have from the skills that individuals who were administered NAAL, but who were not able to answer enough questions correctly to reach the basic level, would be likely to have.
Initially, the committee hoped to consolidate prose, document, and quantitative items into a singled ordered item booklet for the bookmark standard setting, which would have produced cut scores for an overall, combined literacy scale. This was not possible, however, because of an operational decision made by NCES and its contractors to scale the test
items separately by literacy area. That is, the difficulty level of each item was determined separately for prose, document, and quantitative items. This means that it was impossible to determine, for example, if a given prose item was harder or easier than a given document item. This decision appears to have been based on the assumption that the three scales measure different dimensions of literacy and that it would be inappropriate to combine them into a single scale. Regardless of the rationale for the decision, it precluded our setting an overall cut score.
Research and experience suggest that the background and expertise the panelists bring to the standard-setting activity are factors that influence the cut score decisions (Cizek, 2001a; Hambleton, 2001; Jaeger, 1989, 1991; Raymond and Reid, 2001). Furthermore, the Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and National Council on Measurement in Education, 1999) specify that panelists should be highly knowledgeable about the domain in which judgments are required and familiar with the population of test takers. We therefore set up a procedure to solicit recommendations for potential panelists for both standard-setting sessions, review their credentials, and invite those with appropriate expertise to participate. Our goal was to assemble a group of panelists who were knowledgeable about acquisition of literacy skills, had an understanding of the literacy demands placed on adults in this country and the strategies adults use when presented with a literacy task, had some background in standardized testing, and would be expected to understand and correctly implement the standard-setting tasks.
Solicitations for panelists were sent to a variety of individuals: stakeholders who participated in the committee’s public forum, state directors of adult education programs, directors of boards of adult education organizations, directors of boards of professional organizations for curriculum and instruction of adult education programs, and officials with the Council for Applied Linguistics, the National Council of Teachers of English, and the National Council of Teachers of Mathematics. The committee also solicited recommendations from state and federal correctional institutions as well as from the university community for researchers in the areas of workplace, family, and health literacy. Careful attention was paid to including representatives from as many states as possible, including representatives from the six states that subsidized additional testing of adults in 2003 (Kentucky, Maryland, Massachusetts, Missouri, New York, and Oklahoma).
The result of this extensive networking process produced a panel of professionals who represented adult education programs in urban, suburban, and rural geographic areas and a mix of practitioners, including teachers, tutors, coordinators, and directors. Almost all of the panelists had participated at some point in a range-finding or standard-setting activity, which helped them understand the connection between the performance-level descriptions and the task of determining an appropriate cut score.
Because NALS and NAAL are assessments of adult literacy, we first selected panelists with expertise in the fields of adult education and adult literacy. Adult educators may specialize in curriculum and instruction of adult basic education (ABE) skills, preparation of students for the general educational development (GED) certificate, or English for speakers of other languages. In addition, adult education and adult literacy professionals put forth significant curricular, instructional, and research efforts in the areas of workplace literacy, family literacy, and heath literacy. Expertise in all of these areas was represented among the panelists.3
For the July standard setting, only individuals working in adult education and adult literacy were selected to participate. Based on panelist feedback following this standard setting, we decided to broaden the areas of expertise for the September standard setting. Specifically, panelists indicated they would have valued additional perspectives from individuals in areas affected by adult education services, such as human resource management, as well as from teachers who work with middle school and high school students. Therefore, for the second session, we selected panelists from two additional fields: (1) middle or high school language arts teachers and (2) industrial and organizational psychologists who specialize in skill profiling or employee assessment for job placement.
The language arts classroom teachers broadened the standard-setting discussions by providing input on literacy instruction for adolescents who were progressing through the grades in a relatively typical manner, whereas teachers of ABE or GED had experience working with adults who, for
whatever reason, did not acquire the literacy skills attained by most students who complete the U.S. school system. The industrial and organizational psychologists who participated came from academia and corporate environments and brought a research focus and a practitioner perspective to the discussion that complemented those of the other panelists, who were primarily immersed in the adult education field. Table 5-1 gives a profile of the panelists who participated in the two standard-setting sessions.
The first standard-setting session was held to obtain panelists’ judgments about cut scores for the 1992 NALS and to collect their feedback about the performance-level descriptions. A total of 42 panelists participated in the session. Panelists were assigned to groups, and each group was randomly assigned to two of the three literacy areas (prose, document, or quantitative). Group 1 worked with the prose and document items; Group 2 worked with the prose and quantitative items; and Group 3 worked with the document and quantitative items. The sequence in which they worked on the different literacy scales was alternated in an attempt to balance any potential order effects.
For each literacy area, an ordered item booklet was prepared that rank-ordered the test questions from least to most difficult according to NALS examinees’ responses. The ordered item booklets consisted of all the available NALS tasks for a given literacy area, even though with the balanced incomplete block spiraling (see Chapter 2), no individual actually responded to all test questions. The number of items in each NALS ordered item booklet was 39 for prose literacy, 71 for document literacy, and 42 for quantitative literacy.
Two training sessions were held, one for the “table leaders,” the individuals assigned to be discussion facilitators for the tables of panelists, and one for all panelists. The role of the table leader was to serve as a discussion facilitator but not to dominate the discussion or to try to bring the tablemates to consensus about cut scores.
The bookmark process began by having each panelist respond to all the questions in the NALS test booklet for their assigned literacy scale. For this task, the test booklets contained the full complement of NALS items for each literacy scale, arranged in the order test takers would see them but not ranked-ordered as in the ordered item booklets. Afterward, the table leader facilitated discussion of differences among items with respect to knowledge, skills, and competencies required and what was measured by the scoring rubrics.
Panelists then received the ordered item booklets. They discussed each item and noted characteristics they thought made one item more difficult
TABLE 5-1 Profile of Panelists Involved in the Committee’s Standard Settings
|
Participant Characteristics |
July Standard Setting N = 42 |
September Standard Setting N = 30 |
|
Gender |
||
|
Female |
83a |
77 |
|
Male |
17 |
23 |
|
Ethnicity |
||
|
Black |
2 |
7 |
|
Caucasian |
69 |
83 |
|
Hispanic |
0 |
3 |
|
Native American |
2 |
0 |
|
Not reported |
26 |
7 |
|
Geographic Regionb |
||
|
Midwest |
26 |
37 |
|
Northeast |
33 |
23 |
|
South |
7 |
13 |
|
Southeast |
19 |
7 |
|
West |
14 |
20 |
|
Occupationc |
||
|
University instructors |
7 |
10 |
|
Middle school, high school, or adult education instructors |
19 |
30 |
|
Program coordinators or directors |
38 |
40 |
|
Researchers |
12 |
7 |
|
State office of adult education representative |
24 |
13 |
than another. Each table member then individually placed their Round 1 bookmarks representing cut points for basic, intermediate, and advanced literacy.
In preparation for Round 2, each table received a summary of the Round 1 bookmark placements made by each table member and were provided the medians of the bookmark placements (calculated for each table). Table leaders facilitated discussion among table members about their respective bookmark placements, and panelists were then asked to independently make their Round 2 judgments.
In preparation for Round 3, each table received a summary of the Round 2 bookmark placements made by each table member as well as the medians for the table. In addition, each table received information about the proportion of the 1992 population who would have been categorized as having below basic, basic, intermediate, or advanced literacy based on the
|
Participant Characteristics |
July Standard Setting N = 42 |
September Standard Setting N = 30 |
|
Area of Expertise |
||
|
Adult education |
100 |
70 |
|
Classroom teacher |
0 |
17 |
|
Human resources or industrial and organizational psychology |
0 |
13 |
|
Work Setting |
NAd |
|
|
Rural |
|
3 |
|
Suburban |
|
33 |
|
Urban |
|
43 |
|
Combination of all three settings |
|
10 |
|
Other or not reported |
|
10 |
|
aPercentage. bThe geographic regions were grouped in the following way: Midwest (IA, IL, IN, KY, MI, MN, MO, ND, OH, WI), Northeast (CT, DE, MA, ME, MD, NH, NJ, NY, PA, VT), South (AL, LA, MS, OK, TN, TX), Southeast (FL, GA, NC, SC, VA), and West (AZ, CA, CO, MT, NM, NV, OR, UT, WA, WY). cMany panelists reported working in a variety of adult education settings where their work entailed aspects of instruction, curriculum development, program management, and research. For the purposes of constructing this table, the primary duties and/or job title of each panelist, as specified on the panelist’s resume, was used to determine which of the five categories of occupation were appropriate for each panelist. dData not collected in July. |
||
table’s median cut points. After discussion, each panelist made his or her final, Round 3, judgments about bookmark placements for the basic, intermediate, and advanced literacy levels. At the conclusion of Round 3, panelists were asked to provide feedback about the performance-level descriptions by reviewing the items that fell between each of their bookmarks and editing the descriptions accordingly.
The processes described above were repeated for the second literacy area. The bookmark session concluded with a group session to obtain feedback from the panelists, both orally and through a written survey.
In conjunction with the July standard setting, the committee collected information about the impact of varying the instructions given to panelists
with regard to the criteria used to judge the probability that an examinee would answer a question correctly (the response probability). The developers of the bookmark method recommend that a response probability of 67 (or two out of three times) be used and have offered both technical and nontechnical reasons for their recommendation. Their technical rationale stems from an analysis by Huynh (2000) in which the author demonstrated mathematically that the item information provided by a correct response to an open-ended item is maximized at the score point associated with a response probability of 67.4 From a less technical standpoint, the developers of the bookmark method argue that a response probability of 67 percent is easier for panelists to conceive of than less familiar probabilities, such as 57.3 percent (Mitzel et al., 2001). They do not entirely rule out use of other response probabilities, such as 65 or 80, but argue that a response probability of 50 would seem to be conceptually difficult for panelists. They note, however, that research is needed to further understand the ways in which panelists apply response probability instructions and pose three questions that they believe remain to be answered: (1) Do panelists understand, internalize, and use the response probability criterion? (2) Are panelists sensitive to the response probability criterion such that scaling with different levels will systematically affect cut score placements? (3) Do panelists have a native or baseline conception of mastery that corresponds to a response probability?
Given these questions about the ways in which panelists apply response probability instructions, and the controversies surrounding the use of a response probability of 80 in 1992, the committee chose to investigate this issue further. We wanted to find out more about (1) the extent to which panelists understand and can make sense of the concept of response probability level when making judgments about cut scores and (2) the extent to which panelists make different choices when faced with different response probability levels. The committee decided to explore panelists’ use and understanding of three response probability values—67, since it is commonly used with the bookmark procedures, as well as 80 and 50, since these values were discussed in relation to NALS in 1992.
The panelists were grouped into nine tables of five panelists each. Each group was given different instructions and worked with different ordered item booklets. Three tables (approximately 15 panelists) worked with booklets in which the items were ordered with a response probability of 80 percent and received instructions to use 80 percent as the likelihood that the examinee would answer an item correctly. Similarly, three tables used or-
dered item booklets and instructions consistent with a response probability of 67 percent, and three tables used ordered item booklets and instructions consistent with a response probability of 50 percent.
Panelists received training in small groups about their assigned response probability instructions (see Appendix C for the exact wording). Each group was asked not to discuss the instructions about response probability level with anyone other than their tablemates so as not to cause confusion among panelists working with different response probability levels. Each table of panelists used the same response probability level for the second content area as they did for the first.
The performance-level descriptions used at the July standard setting consisted of overall and subject-specific descriptors for the top four performance levels (see Table 5-2). Panelists’ written comments about and edits of the performance levels were reviewed. This feedback was invaluable in helping the committee rethink and reword the level descriptions in ways that better addressed the prose, document, and quantitative literacy demands suggested by the assessment items. Four panelists who had participated in the July standard-setting session were invited to review the revised performance-level descriptions prior to the September standard setting, and their feedback was used to further refine the descriptions. The performance-level descriptions used in the September standard setting are shown in Table 5-3.
A total of 30 panelists from the fields of adult education, middle and high school English language arts, industrial and organizational psychology, and state offices of adult education participated in the second standard setting. Similar procedures were followed as in July with the exception that all panelists used the 67 percent response probability instructions.
Panelists were assigned to groups and the groups were then randomly assigned to literacy area with the subject area assignments balanced as they had been in July. Two tables worked on prose literacy first; one of these tables then worked on document literacy and the other on quantitative literacy. Two tables worked on document literacy first; one of these tables was assigned to work on quantitative literacy and the other to work on prose literacy. The remaining two tables that worked on quantitative literacy first were similarly divided for the second content area: one table was assigned to work on prose literacy while the other was assigned to work on document literacy.
TABLE 5-2 Performance-Level Descriptions Used During the July 2004 NALS Standard Setting
|
A. Overall Descriptions |
||
|
Level |
Description |
|
|
An individual who scores at this level: |
||
|
I. |
Below Basic Literacy |
May be able to recognize some letters, common sight words, or digits in English; has difficulty reading and understanding simple words, phrases, numbers, or quantities. |
|
II. |
Basic Literacy |
Can read and understand simple words, phrases, numbers, and quantities in English and locate information in short texts about commonplace events and situations; has some difficulty with drawing inferences and making use of quantitative information in such texts. |
|
III. |
Intermediate Literacy |
Can read and understand written material in English sufficiently well to locate information in denser, less commonplace texts, construct straightforward summaries, and draw simple inferences; has difficulty with drawing inferences from complex, multipart written material and with making use of quantitative information when multiple operations are involved. |
|
IV. |
Advanced Literacy |
Can read and understand complex written material in English sufficiently well to locate and integrate multiple pieces of information, perform sophisticated analytical tasks such as making systematic comparisons, draw sophisticated inferences from that material, and can make use of quantitative information when multiple operations are involved. |
|
The National Adult Literacy Survey measures competence across a broad range of literacy development. Nonetheless, there exist meaningful distinctions in literacy outside of this range, including degrees of competence well above those described as required for “Advanced Literacy” and below what is required for “Basic Literacy.” The “Below Basic Literacy” and “Advanced Literacy” levels by definition encompass all degrees of literacy below or above, respectively, those levels described in the above performance-level descriptors. |
||
|
B. Subject-Area Descriptions |
||||
|
Level |
Prose |
Document |
Quantitative |
|
|
An individual who scores at this level: |
||||
|
I. |
Below Basic Literacy |
May be able to recognize letters but not able to consistently match sounds with letters; may be able to recognize a few common sight words. |
May be able to recognize letters, numbers, and/or common sight words in familiar contexts such as on labels or signs; is not able to follow written instructions on simple documents. |
May be able to recognize numbers and/or locate numbers in brief familiar contexts; is not able to perform simple arithmetic operations. |
|
II. |
Basic Literacy |
Is able to read and locate information in brief, commonplace text, but has difficulty drawing appropriate conclusions from the text, distinguishing fact from opinion or identifying an implied theme or idea in a selection. |
Is able to understand or follow instructions on simple documents; able to locate and/or enter information based on a literal match of information in the question to information called for in the document itself. |
Is able to locate easily identified numeric information in simple texts, graphs, tables, or charts; able to perform simple arithmetic operations or solve simple word problems when the operation is specified or easily inferred. |
|
III. |
Intermediate Literacy |
Is able to read and understand moderately dense, less commonplace text that contains long paragraphs; able to summarize, make inferences, determine cause and effect, and recognize author’s purpose. |
Is able to locate information in dense, complex documents in which repeated reviewing of the document is involved. |
Is able to locate numeric information that is not easily identified in texts, graphs, tables, or charts; able to perform routine arithmetic operations when the operation is not specified or easily inferred. |
|
B. Subject-Area Descriptions |
||||
|
Level |
Prose |
Document |
Quantitative |
|
|
IV. |
Advanced Literacy |
Is able to read lengthy, complex, abstract texts; able to handle conditional text; able to synthesize information and perform complex inferences. |
Is able to integrate multiple pieces of information in documents that contain complex displays; able to compare and contrast information; able to analyze and synthesize information from multiple sources. |
Is able to locate and integrate numeric information in complex texts, graphs, tables, or charts; able to perform multiple and/or fairly complex arithmetic operations when the operation(s) is not specified or easily inferred. |
TABLE 5-3 Performance-Level Descriptions Used During September 2004 NAAL Standard Setting
|
A. Overall Descriptions |
||
|
Level |
Description |
|
|
An individual who scores at this level independently and in English: |
||
|
I. |
Below Basic Literacy |
May independently be able to recognize some letters, common sight words, or digits in English; may sometimes be able to locate and make use of simple words, phrases, numbers, and quantities in short texts or displays (e.g., charts, figures, or forms) in English that are based on commonplace contexts and situations; may sometimes be able to perform simple one-step arithmetic operations; has some difficulty with reading and understanding information in sentences and short texts. |
|
II. |
Basic Literacy |
Is independently able to read and understand simple words, phrases, numbers, and quantities in English; able to locate information in short texts based on commonplace contexts and situations and enter such information into simple forms; is able to solve simple one-step problems in which the operation is stated or easily inferred; has some difficulty with drawing inferences from texts and making use of more complicated quantitative information. |
|
III. |
Intermediate Literacy |
Is independently able to read, understand, and use written material in English sufficiently well to locate information in denser, less commonplace texts, construct straightforward summaries, and draw simple inferences; able to make use of quantitative information when the arithmetic operation or mathematical relationship is not specified or easily inferred; able to generate written responses that demonstrate these skills; has difficulty with drawing inferences from more complex, multipart written material and with making use of quantitative information when multiple operations or complex relationships are involved. |
|
A. Overall Descriptions |
||
|
Level |
Description |
|
|
IV. |
Advanced Literacy |
Is independently able to read, understand, and use more complex written material in English sufficiently well to locate and integrate multiple pieces of information, perform sophisticated analytical tasks such as making systematic comparisons, draw more sophisticated inferences from that material, and can make use of quantitative information when multiple operations or more complex relationships are involved; able to generate written responses that demonstrate these skills. |
|
The National Assessment of Adult Literacy measures competence across a broad range of literacy development. Nonetheless, there exist meaningful distinctions in literacy outside of this range, including degrees of competence well above those described as required for “Advanced Literacy” and below what is required for “Basic Literacy.” The “Below Basic Literacy” and “Advanced Literacy” levels by definition encompass all degrees of literacy below or above, respectively, those levels described in the above performance-level descriptors. |
||
|
B. Subject-Area Descriptions |
||||
|
Leveel |
Prose |
Document |
Quantitative |
|
|
An individual who scores at this level independently and in English: |
||||
|
I. |
Below Basic Literacy |
May be able to recognize letters but not able to consistently match sounds with letters; may be able to recognize a few common sight words; may sometimes be able to locate information in short texts when the information is easily identifiable; has difficulty reading and understanding sentences. |
May be able to recognize letters, numbers, and/or common sight words in frequently encountered contexts such as on labels or signs; may sometimes be able to follow written instructions on simple displays (e.g., charts, figures, or forms); may sometimes be able to locate |
May be able to recognize numbers and/or locate numbers in frequently encountered contexts; may sometimes be able to perform simple arithmetic operations in commonly used formats or in simple problems when the |
|
|
easily identified information or to enter basic personal information in simple forms. |
mathematical information is very concrete and mathematical relationships are primarily additive. |
||
|
II. |
Basic Literacy |
Is able to read, understand, and locate information in short, commonplace texts when the information is easily identifiable; has difficulty using text to draw appropriate conclusions, distinguish fact from opinion or identify an implied theme or idea in a selection. |
Is able to read, understand, and follow instructions on simple displays; able to locate and/or enter easily identifiable information that primarily involves making a literal match of information in the question to information in the display. |
Is able to locate and use easily identified numeric information in simple texts or displays; able to solve simple one-step problems when the arithmetic operation is specified or easily inferred, the mathematical information is familiar and relatively easy to manipulate, and mathematical relationships are primarily additive. |
|
III. |
Intermediate Literacy |
Is able to read and understand moderately dense, less commonplace text that may contain long paragraphs; able to summarize, make simple inferences, determine cause and effect, and recognize author’s purpose; able to generate written responses that demonstrate these skills. |
Is able to locate information in dense, complex displays in which repeated cycling through the display is involved; able to make simple inferences about the information in the display; able to generate written responses that demonstrate these skills. |
Is able to locate and use numeric information that is not easily identified in texts or displays; able to solve problems when the arithmetic operation is not specified or easily inferred, and mathematical information is less familiar and more difficult to manipulate. |
|
B. Subject-Area Descriptions |
||||
|
Level |
Prose |
Document |
Quantitative |
|
|
IV. |
Advanced Literacy |
Is able to read lengthy, complex, abstract texts that are less commonplace and may include figurative language, to synthesize information and make complex inferences; able to generate written responses that demonstrate these skills. |
Is able to integrate multiple pieces of information located in complex displays; able to compare and contrast information, and to analyze and synthesize information from multiple sources; able to generate written responses that demonstrate these skills. |
Is able to locate and use numeric information in complex texts and displays; able to solve problems that involve multiple steps and multiple comparisons of displays when the operation(s) is not specified or easily inferred, the mathematical relationships are more complex, and the mathematical information is more abstract and requires more complex manipulations. |
The ordered item booklets used for the second standard setting were organized in the same way as for the first standard setting, with the exception that some of the NAAL test questions were scored according to a partial credit scheme. When a partial credit scoring scheme is used, a difficulty value is estimated for both the partially correct score and the fully correct score. As a result, the test questions have to appear multiple times in the ordered item booklet, once for the difficulty value associated with partially correct and a second time for the difficulty value associated with fully correct. The ordered item booklets included the scoring rubric for determining partial credit and full credit scores.
Training procedures in September were similar to those used in July. Table leader training was held the day before the standard setting, and panelist training was held on the first day of the standard setting.
The procedures used in September were similar to those used in July, with the exception that the committee decided that all panelists in September should use the instructions for a response probability of 67 (the rationale for this decision is documented in the results section of this chapter). This meant that more typical bookmark procedures could be used for the Round 3 discussions. That is, groups of panelists usually work on the same ordered item booklet at different tables during Rounds 1 and 2 but join each other for Round 3 discussions. Therefore, in September, both tables working on the same literacy scale were merged for the Round 3 discussion.
During Round 3, panelists received data summarizing bookmark placements for the two tables combined. This included a listing of each panelist’s bookmark placements and the median bookmark placements by table. In addition, the combined median scale score (based on the data from both tables) was calculated for each level, and impact data provided about the percentages of adults who would fall into the below basic, basic, intermediate, and advanced categories if the combined median values were used as cut scores.5 Panelists from both tables discussed their reasons for choosing different bookmark placements, after which each panelist independently made a final judgment of items that separated the test among basic, intermediate, and advanced literacy.
At the conclusion of the September standard setting, 12 of the panelists were asked to stay for an extended session to write performance-level de-
scriptions for the NAAL items. At least one member from each of the six tables participated in the extended session, and there was representation from each of the three areas of expertise (adult education, middle and high school English language arts, and industrial and organizational psychology). The 12 participants were split into 3 groups, each focusing on one of the three NAAL content areas. Panelists were instructed to review the test items that would fall into each performance level (based on the Round 3 median cut scores) and prepare more detailed versions of the performance-level descriptions, including specific examples from the stimuli and associated tasks. The revised descriptions are shown in Table 5-4.
The purpose of using the different instructions in the July session was to evaluate the extent to which the different response probability criteria influenced panelists’ judgments about bookmark placements. It would be expected that panelists using the higher probability criteria would place their bookmarks earlier in the ordered item booklets, and as the probability criteria decrease, the bookmarks would be placed later in the booklet. For example, panelists working with rp50 instructions were asked to select the items that individuals at a given performance level would be expected to get right 50 percent of the time. This is a relatively low criterion for success on a test question, and, as a result, the panelist should require the test taker to get more items correct than if a higher criterion for success were used (e.g., rp67 or rp80). Therefore, for a given performance level, the bookmark placement should be in reverse order of the values of the response probability criteria: the rp80 bookmark placement should come first in the booklet, the rp67 bookmark should come next, and the rp50 bookmark should be furthest into the booklet.
Tables 5-5a, 5-5b, and 5-5c present the results from the July standard setting, respectively, for the prose, document, and quantitative areas. The first row of each table shows the median bookmark placements for basic, intermediate, and advanced based on the different response probability instructions. For example, Table 5-5a shows that the median bookmark placements for the basic performance level in prose were on item 6 under the rp80 and rp67 instructions and on item 8 under the rp50 instructions.
Ideally, panelists would compensate for the different response criteria by placing their bookmarks earlier or later in the ordered item booklet, depending on the response probability instructions. When panelists respond to the bookmark instructions by conceptualizing a person whose skills
match the performance-level descriptions, the effect of using different response probability instructions would shift their bookmark placements in such a way that they compensated exactly for the differences in the translation of bookmark placements into cut scores. When panelists are informing their judgments in this way, the cut score associated with the bookmark placement would be identical under the three different response probability instructions, even though the bookmark locations would differ. As the tables show, however, this does not appear to be the case. For example, the second row of Table 5-5a shows that the median cut scores for basic were different: 226, 211, and 205.5, respectively, for rp80, rp67, and rp50.
It is not surprising that panelists fail to place bookmarks in this ideal way, for the ideal assumes prior knowledge of the likelihood that persons at each level of literacy will answer each item correctly. A more relevant issue is whether judges have a sufficient subjective understanding of probability to change bookmark placements in response to different instructions about response probabilities. Our analysis yields weak evidence in favor of the latter hypothesis.6
We conducted tests to evaluate the statistical significance of the differences in bookmark placements and in cut scores. The results indicated that, for a given literacy area and performance level, the bookmark placements were tending in the right direction but were generally not statistically significantly different under the three response probability instructions. In contrast, for a given literacy area and performance level, the differences among the cut scores were generally statistically significant. Additional details about the analyses we conducted appear in Appendix C.
Tables 5-5a, 5-5b, and 5-5c also present the mean and standard deviations of the cut scores under the different response probability instructions. The standard deviations provide an estimate of the extent of variability among the panelists’ judgments. Although the bookmark method does not strive for consensus among panelists, the judgments should not be widely disparate. Comparison of the standard deviations across the different response probability instructions reveals no clear pattern; that is, there is no indication that certain response probability instructions were superior to the others in terms of the variability among panelists’ judgments.
A more practical way to evaluate these differences is by looking at the
TABLE 5-4 Performance-Level Descriptions and Subject-Area Descriptions with Exemplar NAAL Items
|
A. Overall Description |
||
|
Level |
Description |
Sample Tasks Associated with Level |
|
Nonliterate in English |
May independently recognize some letters, numbers, and/or common sight words in English in frequently encountered contexts. |
|
|
Below Basic |
May independently be able to locate and make use of simple words, phrases, numbers, and quantities in short texts or displays (e.g., charts, figures, forms) in English that are based on commonplace contexts and situations; may sometimes be able to perform simple one-step arithmetic operations. |
|
|
Basic |
Is independently able to read and understand simple words, phrases, numbers, and quantities in English when the information is easily identifiable with a minimal amount of distracting information; able to locate information in short texts based on commonplace contexts and situations and enter such information into simple forms; is able to solve simple one-step problems in which the operation is stated or easily inferred. |
|
|
Intermediate |
Is independently able to read, understand, and use written material in English sufficiently well to locate information in denser, less commonplace texts that |
|
|
|
may contain a greater number of distractors; able to construct straightforward summaries, and draw simple inferences; able to make use of quantitative information when the arithmetic operation or mathematical relationship is not specified or easily inferred; able to generate written responses that demonstrate these skills. |
an employee is eligible for medical insurance) (document)
|
|
Advanced |
Is independently able to read, understand, and use more complex written material in English sufficiently well to locate and integrate multiple pieces of information, perform more sophisticated analytical tasks such as making systematic comparisons, draw more sophisticated inferences from that material, and can make use of quantitative information when multiple operations or more complex relationships are involved; able to generate written responses that demonstrate these skills. |
|
|
The NAAL measures competence across a broad range of literacy development. Nonetheless, there exist meaningful distinctions in literacy outside of this range, including degrees of competence well above those described as required for “Advanced Literacy” and below what is required for “Basic Literacy.” The “Below Basic Literacy” and “Advanced Literacy” levels, by definition, encompass all degrees of literacy below or above, respectively, those levels described in the above performance-level descriptors. |
||
|
B. Prose Literacy Content Area |
||
|
Level |
An individual who scores at this level independently and in English: |
Sample of NAAL tasks associated with the level |
|
Below Basic |
May sometimes be able to locate information in short texts when the information is easily identifiable. |
|
|
Basic |
Is able to read, understand, follow directions, copy, and locate information in short, commonplace texts (e.g., simple newspaper articles, advertisements, short stories, government forms) when the information is easily identifiable with a minimal number of distractors in the main text. May be able to work with somewhat complex texts to complete a literal match of information in the question and text.* |
|
|
Intermediate |
Is able to read and understand moderately dense, less commonplace text that may contain long paragraphs, a greater number of distractors, a higher level vocabulary, longer sentences, more complex sentence structure; able to summarize, make simple inferences, determine cause and effect, and recognize author’s purpose; able to generate written responses (e.g., words, phrases, lists, sentences, short paragraphs) that demonstrate these skills.* |
|
|
Advanced |
Is able to read lengthy, complex, abstract texts that are less commonplace and may include figurative language and/or unfamiliar vocabulary; able to synthesize information and make complex inferences; compare and contrast viewpoints; able to generate written responses that demonstrate these skills.* |
|
|
*When presented with a task that measures these skills, the individual would be likely to respond correctly 2 out of 3 times. |
||
|
C. Document Literacy Content Area |
||
|
Level |
An individual who scores at this level independently and in English: |
Sample of NAAL tasks associated with the level |
|
Below Basic |
May sometimes be able to follow written instructions on simple displays (e.g., charts, figures, or forms); may sometimes be able to locate easily identified information or to enter basic personal information on simple forms; may be able to sign name in right place on form. |
|
|
Basic |
Is able to read, understand, and follow one-step instructions on simple displays (e.g., government, banking, and employment application forms, short newspaper articles or advertisements, television or public transportation schedules, bar charts or circle graphs of a single variable); able to locate and/or enter easily identifiable information that primarily involves making a literal match between the question and the display.* |
|
|
C. Document Literacy Content Area |
||
|
Level |
An individual who scores at this level independently and in English: |
Sample of NAAL tasks associated with the level |
|
Intermediate |
Is able to locate information in dense, complex displays (e.g., almanacs or other reference materials, maps and legends, government forms and instruction sheets, supply catalogues and product charts, more complex graphs and figures that contain trends and multiple variables) when repeated cycling or re-reading is involved; able to make simple inferences about the information displayed; able to generate written responses that demonstrate these skills.* |
|
|
Advanced |
Is able to integrate multiple pieces of information located in complex displays; able to compare and contrast information, and to analyze and synthesize information from multiple sources; able to generate written responses that demonstrate these skills.* |
|
|
*When presented with a task that measures these skills, the individual would be likely to respond correctly 2 out of 3 times. |
||
|
D. Quantitative Literacy Content Area |
||
|
Level |
An individual who scores at this level independently and in English: |
Sample of NAAL tasks associated with the level |
|
Below Basic |
May sometimes be able to perform simple arithmetic operations in commonly used formats or in simple problems when the mathematical information is very concrete and mathematical relationships are primarily additive. |
|
|
Basic |
Is able to locate and use easily identified numeric information in simple texts or displays; able to solve simple one-step problems when the arithmetic operation is specified or easily inferred, the mathematical information is familiar and relatively easy to manipulate, and mathematical relationships are primarily additive.* |
|
|
Intermediate |
Is able to locate numeric information that is embedded in texts or in complex displays and use that information to solve problems; is able to infer the arithmetic operation or mathematical relationship when it is not specified; is able to use fractions, decimals, or percents and to apply concepts of area and perimeter in real-life contexts.* |
|
|
D. Quantitative Literacy Content Area |
||
|
Level |
An individual who scores at this level independently and in English: |
Sample of NAAL tasks associated with the level |
|
Advanced |
Is able to locate and use numeric information in complex texts and displays; able to solve problems that involve multiple steps and multiple comparisons of displays when the operation(s) is/are not specified or easily inferred, the mathematical relationships are more complex, and the mathematical information is more abstract and requires more complex manipulations.* |
|
|
*When presented with a task that measures these skills, the individual would be likely to respond correctly 2 out of 3 times. |
||
impact data. The final row of Tables 5-5a, 5-5b, and 5-5c compares the percentage of the population scoring below each of the cut scores when the different response probability instructions were used. Comparison of the impact data reveals that the effects of the different response probability instructions were larger for the cut scores for the document and quantitative areas than for prose.
These findings raise several questions. First, the findings might lead one to question the credibility of the cut scores produced by the bookmark method. However, there is ample evidence that people have difficulty interpreting probabilistic information (Tversky and Kahneman, 1983). The fact that bookmark panelists have difficulties with this aspect of the procedure is not particularly surprising. In fact, the developers of the procedure appear to have anticipated this, saying “it is not reasonable to suggest that lack of understanding of the response probability criterion invalidates a cut score judgment any more than a lack of understanding of [item response theory] methods invalidates the interpretation of a test score” (Mitzel et al., 2001, p. 262).
In our opinion, the bookmark procedure had been implemented very carefully with strict attention to key factors that can affect the results (Cizek, Bunch, and Koons, 2004; Hambleton, 2001; Kane, 2001; Plake, Melican, and Mills, 1992: Raymond and Reid, 2001). The standard-setting panelists had been carefully selected and had appropriate background qualifications. The instructions to panelists were very clear, and there was ample time for clarification. Committee members and staff observing the process were impressed with how it was carried out, and the feedback from the standard-setting panelists was very positive. Kane (2001) speaks of this as “procedural evidence” in support of the appropriateness of performance standards, noting that “procedural evidence is a widely accepted basis for evaluating policy decisions” (p. 63). Thus, while the findings indicated that panelists had difficulty implementing the response probability instructions exactly as intended, we judged that this did not seem to be sufficient justification for discrediting the bookmark method entirely.
The second issue presented by the findings was that if the different response probability instructions had produced identical cut scores, it would not have mattered which response probability the committee decided to use for the bookmark procedure. However, the findings indicated that different cut scores were produced by the different instructions; hence, the committee had to select among the options for response probability values.
As discussed in Chapter 3, the choice of a response probability value involves weighing both technical and nontechnical information to make a judgment about the most appropriate value given the specific assessment context. We had hoped that the comparison of different response probability instructions would provide evidence to assist in this choice. However,
TABLE 5-5a Median Bookmark Placements and Cut Scores for the Three Response Probability (RP) Instructions in the July 2004 Standard Setting with NALS Prose Items (n = 39 items)
TABLE 5-5b Median Bookmark Placements and Cut Scores for the Three Response Probability (RP) Instructions in the July 2004 Standard Setting with NALS Document Items (n = 71 items)
TABLE 5-5c Median Bookmark Placements and Cut Scores for the Three Response Probability (RP)Instructions in the July 2004 Standard Setting with NALS Quantitative Items (n = 42 items)
none of the data suggested that one response probability value was “better” than another.
In follow-up debriefing sessions, panelists commented that the rp50 instructions were difficult to apply, in that it was hard to determine bookmark placement when thinking about a 50 percent chance of responding correctly. This concurs with findings from a recent study conducted in connection with standard setting on the NAEP (Williams and Schulz, 2005). As stated earlier, the developers of the bookmark method also believe this value to be conceptually difficult for panelists.
A response probability of 80 percent had been used in 1992, in part to reflect what is often considered to be mastery level in the education field. The committee debated about the appropriateness of this criterion versus the 67 percent criterion, given the purposes and uses of the assessment results. The stakes associated with the assessment are low; that is, no scores are reported for individuals, and no decisions affecting an individual are based on the results. A stringent criterion, like 80 percent, would be called for when it is important to have a high degree of certainty that the individual has truly mastered the specific content or skills, such as in licensing examinations.
A response probability of 67 percent is recommended in the literature by the developers of the bookmark procedure (Mitzel et al., 2001) and is the value generally used in practice. Since there was no evidence from our comparison of response probabilities to suggest that we should use a value other than the developer’s recommendation, the committee decided to use a response probability of 67 percent for the bookmark procedure for NALS and NAAL. Therefore, all panelists in the September standard setting used this criterion. In determining the final cut scores from the bookmark procedure, we used all of the judgments from September but only the judgments from July based on the rp67 criterion.
We are aware that many in the adult education, adult literacy, and health literacy fields have grown accustomed to using the rp80 criterion in relation to NALS results, and that some may at first believe that use of a response probability of 67 constitutes “lowering the standards.” We want to emphasize that this represents a fundamental, albeit not surprising, misunderstanding. Changing the response probability level does not alter the test in any way; the same content and skills are evaluated. Changing the response probability level does not alter the distributions of scores. Distributions of skills are what they are estimated to be, regardless of response probability levels. The choice of response probability levels should not in principle affect proportions of people in regions of the distribution, although some differences were apparent in our comparisons. Choice of response probability levels does affect a user’s attention in terms of con-
densed, everyday-language conceptions of what it means to be at a level (e.g., what it means to be “proficient”).
It does appear that some segments of the literacy community prefer the higher response probability value of 80 percent as a reporting and interpretive device, if for nothing other than continuity with previous literacy assessments. The response probability level of 80 percent is robust to the fact that a response probability level is mapped to a verbal expression, such as “can consistently” or “can usually” do items of a given difficulty (or worse, more simplistic interpretations, such as “can” as opposed to “cannot” do items of a given difficulty level). It is misapplying this ambiguous mapping from precise and invariant quantitative descriptions to imprecise, everyday verbal descriptions that gives the impression of lowering standards. Changing the response probability criterion in the report may be justified by the reasons discussed above, but we acknowledge that disadvantages to this recommendation include the potential for misinterpretations and a less preferable interpretation in the eyes of some segments of the user community.
In addition, use of a response probability of 67 percent for the bookmark standard-setting procedure does not preclude using a value of 80 percent in determining exemplary items for the performance levels. That is, for each of the performance levels, it is still possible to select exemplar items that demonstrate the types of questions individuals have an 80 percent chance of answering correctly. Furthermore, it is possible to select exemplary items that demonstrate other probabilities of success (67 percent, 50 percent, 35 percent, etc.). We discussed this issue in Chapter 3 and return to it in Chapter 6.
Table 5-6 presents the median cut scores that resulted from the rp67 instructions for the July standard setting (column 1) along with the median cut scores that resulted from the September standard setting (column 2). Column 3 shows the overall median cut scores that resulted when the July and September judgments were combined, and column 5 shows the overall mean cut score. To provide a sense of the spread of panelists’ judgments about the placement of the bookmarks, two measures of variability are shown. The “interquartile range” of the cut scores is shown in column 4. Whereas the median cut score represents the cut score at the 50th percentile in the distribution of panelists’ judgments, the interquartile range shows the range of cut score values from the 25th percentile to the 75th percentile.
TABLE 5-6 Summary Statistics from the Committee’s Standard Settings for Adult Literacy
|
|
(1) July Median Cut Scorea |
(2) September Median Cut Scoreb |
(3) Overall Median Cut Scorec |
|
Prose Literacy |
|||
|
(1) Basic |
211 |
219 |
211 |
|
(2) Intermediate |
270 |
281 |
270 |
|
(3) Advanced |
336 |
345 |
345 |
|
Document Literacy |
|||
|
(4) Basic |
189 |
210 |
203 |
|
(5) Intermediate |
255 |
254 |
254 |
|
(6) Advanced |
344 |
345 |
345 |
|
Quantitative Literacy |
|||
|
(7) Basic |
244 |
244 |
244 |
|
(8) Intermediate |
307 |
295 |
296 |
|
(9) Advanced |
352 |
356 |
356 |
|
aThe July standard setting used the items from the 1992 NALS. The cut scores are based on the bookmark placements set by panelists using the rp67 guidelines. bThe September standard setting used items from the 2003 NAAL. All panelists used rp67 guidelines. |
|||
Column 6 presents the standard deviation, and column 7 shows the range bounded by the mean plus and minus one standard deviation.
Comparison of the medians from the July and September standard-setting sessions reveals that the September cut scores tended to be slightly higher than the July cut scores, although overall the cut scores were quite similar. The differences in median cut scores ranged from 0 to 21, with the largest difference occurring for the basic cut score for document literacy. Examination of the spread in cut scores based on the standard deviation reveals more variability in the advanced cut score than for the other performance levels. Comparison of the variability in cut scores in each literacy area shows that, for all literacy areas, the standard deviation for the advanced cut score was at least twice as large as the standard deviation for the intermediate or basic cut scores. Comparison of the variability in cut scores across literacy areas shows that, for all of the performance levels, the standard deviations for the quantitative literacy cut scores were slightly higher than for the other two sections. There was considerable discussion (and some disagreement) among the panelists about the difficulty level of the quantitative section, which probably contributed to the larger variability in these cut scores. We address this issue in more detail later in this chapter. Appendixes C and D include additional results from the bookmark standard setting.
|
(4) Interquartile Ranged |
(5) Overall Mean Cut Score |
(6) Standard Deviation |
(7) Mean ± One Standard Deviation |
|
206-221 |
214.2 |
11.0 |
199.6-221.6 |
|
264-293 |
275.9 |
16.2 |
254.2-86.7 |
|
336-366 |
355.6 |
33.5 |
311.9-378.8 |
|
192-210 |
200.1 |
13.4 |
189.8-216.6 |
|
247-259 |
254.0 |
9.1 |
244.7-262.8 |
|
324-371 |
343.0 |
30.8 |
314.2-375.9 |
|
230-245 |
241.3 |
19.7 |
223.8-263.3 |
|
288-307 |
293.8 |
17.1 |
279.4-313.5 |
|
343-398 |
368.6 |
41.8 |
313.9-397.6 |
|
cThe overall median is the median cut score when both the July rp67 and September data were combined. dRange of cut scores from the first quartile (first value in range) to the third quartile (second value in range). |
|||
The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 1999) recommend reporting information about the amount of variation in cut scores that might be expected if the standard-setting procedure were replicated. The design of our bookmark sessions provided a means for estimating the extent to which the cut scores would be likely to vary if another standard setting was held on a different occasion with a different set of judges.
As described earlier, participants in the July and September standard-setting sessions were divided into groups, each of which focused on two of the three literacy areas. At each session, panelists worked on their first assigned literacy area during the first half of the session (which can be referred to as “Occasion 1”) and their second assigned literacy area during the second half of the session (referred to as “Occasion 2”). This design for the standard setting allowed for cut score judgments to be obtained on four occasions that were essentially replications of each other: two occasions from July and two occasions from September. Thus, the four occasions can be viewed as four replications of the standard-setting procedures.
The median cut score for each occasion was determined based on the panelists’ Round 3 bookmark placements; these medians are shown in
TABLE 5-7 Confidence Intervals for the Bookmark Cut Scores
Table 5-7. The average of these occasion medians was calculated by weighting each median by the number of panelists. The 95 percent confidence intervals for the weighted averages were computed, which indicate the range in which the cut scores would be expected to fall if the standard-setting session was repeated. For example, a replication of the standard-setting session would be likely to yield a cut score for the prose basic level literacy in the range of 200.5 to 225.5. We revisit these confidence intervals later in the chapter when we make recommendations for the cut scores.
In a typical contrasting groups procedure, the standard-setting panelists are individuals who know the examinees firsthand in teaching, learning, or work environments. Using the performance-level descriptions, the panelists are asked to place examinees into the performance categories in which they judge the examinees belong without reference to their actual performance on the test. Cut scores are then determined from the actual test scores attained by the examinees placed in the distinct categories. The goal is to set the cut score such that the number of misclassifications is roughly the same in both directions (Kane, 1995); that is, the cut score that mini-
|
Weighted Average of the Mediansa |
Standard Deviation |
Standard Errorb |
95% Confidence Interval for the Weighted Averagec |
|
213.0 |
12.7 |
6.4 |
200.5 to 225.5 |
|
275.7 |
14.6 |
7.3 |
261.3 to 290.0 |
|
355.7 |
22.6 |
11.3 |
333.5 to 377.8 |
|
201.5 |
9.8 |
4.9 |
191.9 to 211.1 |
|
255.2 |
9.9 |
5.0 |
245.5 to 264.8 |
|
346.8 |
26.6 |
13.3 |
320.7 to 372.9 |
|
244.2 |
18.7 |
9.4 |
225.9 to 262.5 |
|
294.3 |
11.7 |
5.9 |
282.8 to 305.8 |
|
369.7 |
27.2 |
13.6 |
343.0 to 396.4 |
|
cThe confidence interval is the weighted average plus or minus the bound, where the bound was calculated as the standard score at the .05 confidence level multiplied by the standard error. |
|||
mizes the number of individuals who correctly belong in an upper group but are placed into a lower group (false negative classification errors) and likewise minimizes the number of individuals who correctly belong in a lower group but are placed into an upper group (false positive classification errors).
Because data collection procedures for NALS and NAAL guarantee the anonymity of test takers, there was no way to implement the contrasting groups method as it is typically conceived. Instead, the committee designed a variation of this procedure that utilized the information collected via the background questionnaire to form groups of test takers. For example, test takers can be separated into two distinct groups based on their responses about the amount of help they need with reading: those who report they need a lot of help with reading and those who report they do not need a lot of help. Comparison of the distribution of literacy scores for these two groups provides information that can be used in determining cut scores.
This approach, while not a true application of the contrasting groups method, seemed promising as a viable technique for generating a second set of cut scores with which to judge the reasonableness of the bookmark cut scores. This QCG method differs from a true contrasting groups approach
in two key ways. First, because it was impossible to identify and contact respondents after the fact, no panel of judges was assembled to classify individuals into the performance categories. Second, due to the nature of the background questions, the groups were not distinguished on the basis of characteristics described by the performance-level descriptions. Instead, we used background questions as proxies for the functional consequences of the literacy levels, and, as described in the next section, aligned the information with the performance levels in ways that seemed plausible. We note that implementation of this procedure was limited by the available background information. In particular, there is little information on the background questionnaire that can serve as functional consequences of advanced literacy. As discussed in Chapter 4, additional background information about advanced literacy habits (e.g., number and character of books read in the past year, types of newspapers read, daily or weekly writing habits) would have helped refine the distinction between intermediate and advanced literacy skills.
From the set of questions available in both the NALS and NAAL background questionnaires, we identified the following variables to include in the QCG analyses: education level, occupation, two income-related variables (receiving federal assistance, receiving interest or dividend income), self-rating of reading skills, level of assistance needed with reading, and participation in reading activities (reading the newspaper, using reading at work). We examined the distribution of literacy scores for specific response options to the background questions.
The below basic and basic levels originated partly from policy distinctions about the provision of supplemental adult education services; thus, we expected the cut score between below basic and basic to be related to a recognized need for adult literacy services. Therefore, for each literacy area, the bookmark cut score between below basic and basic was compared with the QCG cut score that separated individuals with 0-8 years of formal education (i.e., no high school) and those with some high school education. To determine this QCG cut score, we examined the distributions of literacy scores for the two groups to identify the point below which most of those with 0-8 years of education scored and above which most of those with some high school scored. To accomplish this, we determined the median score (50th percentile) in each literacy area for those with no high school education and the median score (50th percentile) for those with some high school education. We then found the midpoint between these two medians
(which is simply the average of the two medians).7 Table 5-8 presents this information. For example, the table shows that in 1992 the median prose score for those with no high school was 182; the corresponding median for those with some high school was 236. The midpoint between these two medians is 209. Likewise, for 2003, the median prose score for those with no high school was 159 and for those with some high school was 229. The midpoint between these two medians is 194.
We also judged that self-rating of reading skills should be related to the distinction between below basic and basic, and the key relevant contrast would be between those who say they do not read well and those who say they do read well. Following the procedures described above, for each literacy area, we determined the median score for those who reported that they do not read well (e.g., in 1992, the value for prose was 140) and those who reported that they read well (e.g., in 1992, the value for prose was 285). The midpoint between these two values is 212.5. The corresponding median prose scores for the 2003 participants were 144 for those who report they do not read well and 282 for those who report that they read well, which results in a midpoint of 213.
We then combined the cut scores suggested by these two contrasts (no high school versus some high school; do not read well versus read well) by averaging the four midpoints for the 1992 and 2003 results (209, 194, 212.5, and 213). We refer to this value as the QCG cut score. Combining the information across multiple background variables enhances the stability of the cut score estimates. Table 5-8 presents the QCG cut scores for the basic performance level for prose (207.1), document (205.1), and quantitative (209.9) literacy.
The contrast between the basic and intermediate levels was developed to reflect a recognized need for GED preparation services. Therefore, the bookmark cut score between these two performance levels was compared with the contrast between individuals without a high school diploma or GED certificate and those with a high school diploma or GED. Furthermore, because of a general policy expectation that most individuals can and should achieve a high school level education but not necessarily more, we expected the contrast between the basic and intermediate levels to be associated with a number of other indicators of unsuccessful versus successful
TABLE 5-8 Comparison of Weighted Median Scaled Scores for Groups Contrasted to Determine the QCG Cut Scores for Basic Literacy
|
Groups Contrasted |
Weighted Median Scorea |
||
|
1992 |
2003 |
||
|
Prose Literacy |
|||
|
Education: |
|||
|
|
No high school |
182 |
159 |
|
|
Some high school |
236 |
229 |
|
|
Average of medians |
209.0 |
194.0 |
|
Self-perception of reading skills: |
|||
|
|
Do not read well |
140 |
144 |
|
|
Read well |
285 |
282 |
|
|
Average of medians |
212.5 |
213.0 |
|
Contrasting groups cut score for prose: 207.1b |
|||
|
Document Literacy |
|||
|
Education: |
|||
|
|
No high school |
173 |
160 |
|
|
Some high school |
232 |
231 |
|
|
Average of medians |
202.5 |
195.5 |
|
Self-perception of reading skills: |
|||
|
|
Do not read well |
138 |
152 |
|
|
Read well |
279 |
276 |
|
|
Average of medians |
208.5 |
214.0 |
|
Contrasting groups cut score for document: 205.1 |
|||
functioning in society available on the background questionnaire, specifically the contrast between:
Needing a lot of help with reading versus not needing a lot of help with reading.
Never reading the newspaper versus sometimes reading the newspaper.
Working in a job in which reading is never used versus working in a job in which reading is used.
Receiving Aid to Families with Dependent Children or food stamps versus receiving interest or dividend income.
Following the procedures described above for the basic performance
|
Groups Contrasted |
Weighted Median Scorea |
||
|
1992 |
2003 |
||
|
Quantitative Literacy |
|||
|
Education: |
|||
|
|
No high school |
173 |
165 |
|
|
Some high school |
233 |
231 |
|
|
Average of medians |
203.0 |
198.0 |
|
Self-perception of reading skills: |
|||
|
|
Do not read well |
138 |
166 |
|
|
Read well |
285 |
288 |
|
|
Average of medians |
211.5 |
227.0 |
|
Contrasting groups cut score for quantitative: 209.9 |
|||
|
aFor 1992, the median scores are calculated on a sample representing the entire adult population. For 2003, the median scores are calculated on a sample that excludes respondents with no responses to literacy tasks due to various “literacy-related reasons,” as determined by the interviewer. These excluded respondents correspond to roughly 2 percent of the adult population. Assuming that these respondents are at the lower end of the literacy scale (since they do not have answers for literacy-related reasons), their exclusion causes an upward bias in the calculated medians as an estimate of the true median of the full adult population. The impact of this bias on the standard setting is likely to be small for two reasons. First, a comparison of the medians for 1992 and 2003 suggest that the medians are relatively close and that the bias is probably not large. Second, the averaging procedure in the QCG calculation dilutes the effect of the biased 2003 results by averaging them with the unbiased 1992 results. bThe cut score is the overall average of the weighted medians for the groups contrasted. |
|||
level, we determined the cut score for the contrasted groups in the above list, and Table 5-9 presents these medians for the three types of literacy. For example, the median prose score in 1992 for those with some high school was 236; the corresponding median for those with a high school diploma was 274; and the midpoint between these medians was 255. We determined the corresponding medians from the 2003 results (which were 229 for those with some high school and 262 for those with a high school diploma, yielding a midpoint of 245.5). We then averaged the midpoints resulting from the contrasts on these five variables to yield the QCG cut score. These QCG cut scores for prose (243.5), document (241.6), and quantitative (245.4) literacy areas appear in Table 5-9.
The contrast between the intermediate and advanced levels was intended to relate to pursuit of postsecondary education or entry into profes-
TABLE 5-9 Comparison of Weighted Median Scaled Scores for Groups Contrasted to Determine the QCG Cut Scores for Intermediate Literacy
|
Groups Contrasted |
Weighted Median Scorea |
||
|
1992 |
2003 |
||
|
Prose Literacy |
|||
|
Education: |
|||
|
|
Some high school |
236 |
229 |
|
|
High school diploma |
274 |
262 |
|
|
Average of medians |
255.0 |
245.5 |
|
Extent of help needed with reading: |
|||
|
|
A lot |
135 |
153 |
|
|
Not a lot |
281 |
277 |
|
|
Average of medians |
208.0 |
215.0 |
|
Read the newspaper: |
|||
|
|
Never |
161 |
173 |
|
|
Sometimes, or more |
283 |
280 |
|
|
Average of medians |
222.0 |
226.5 |
|
Read at work: |
|||
|
|
Never |
237 |
222 |
|
|
Sometimes, or more |
294 |
287 |
|
|
Average of medians |
265.5 |
254.5 |
|
Financial status: |
|||
|
|
Receive federal assistance |
246 |
241 |
|
|
Receive interest, dividend income |
302 |
296 |
|
|
Average of medians |
274.0 |
268.5 |
|
Contrasting groups cut score for prose: 243.5b |
|||
|
Document Literacy |
|||
|
Education: |
|||
|
|
Some high school |
232 |
231 |
|
|
High school diploma |
267 |
259 |
|
|
Average of medians |
249.5 |
245.0 |
|
Extent of help needed with reading: |
|||
|
|
A lot |
128 |
170 |
|
|
Not a lot |
275 |
273 |
|
|
Average of medians |
201.5 |
221.5 |
|
Read the newspaper: |
|||
|
|
Never |
154 |
188 |
|
|
Sometimes, or more |
278 |
275 |
|
|
Average of medians |
216.0 |
231.5 |
|
Read at work: |
|||
|
|
Never |
237 |
228 |
|
|
Sometimes, or more |
289 |
282 |
|
|
Average of medians |
263.0 |
255.0 |
|
Groups Contrasted |
Weighted Median Scorea |
||
|
1992 |
2003 |
||
|
Financial status: |
|||
|
|
Receive federal assistance |
242 |
240 |
|
|
Have interest/dividend income |
295 |
288 |
|
|
Average of medians |
268.5 |
264.0 |
|
Contrasting groups cut score for document: 241.6 |
|||
|
Quantitative Literacy |
|||
|
Education: |
|||
|
|
Some high school |
233 |
231 |
|
|
High school diploma |
275 |
270 |
|
|
Average of medians |
254.0 |
250.5 |
|
Extent of help needed with reading: |
|||
|
|
A lot |
114 |
162 |
|
|
Not a lot |
282 |
285 |
|
|
Average of medians |
198.0 |
223.5 |
|
Read the newspaper: |
|||
|
|
Never |
145 |
197 |
|
|
Sometimes, or more |
284 |
287 |
|
|
Average of medians |
214.5 |
242.0 |
|
Read at work: |
|||
|
|
Never |
236 |
233 |
|
|
Sometimes, or more |
294 |
294 |
|
|
Average of medians |
265.0 |
263.5 |
|
Financial status: |
|||
|
|
Receive federal assistance |
240 |
237 |
|
|
Have interest/dividend income |
303 |
305 |
|
|
Average of medians |
271.5 |
271.0 |
|
Contrasting groups cut score for quantitative: 245.4 |
|||
|
aFor 1992, the median scores are calculated on a sample representing the entire adult population. For 2003, the median scores are calculated on a sample that excludes respondents with no responses to literacy tasks due to various “literacy-related reasons,” as determined by the interviewer. These excluded respondents correspond to roughly 2 percent of the adult population. Assuming that these respondents are at the lower end of the literacy scale (since they do not have answers for literacy-related reasons), their exclusion causes an upward bias in the calculated medians as an estimate of the true median of the full adult population. The impact of this bias on the standard setting is likely to be small for two reasons. First, a comparison of the medians for 1992 and 2003 suggest that the medians are relatively close and that the bias is probably not large. Second, the averaging procedure in the QCG calculation dilutes the effect of the biased 2003 results by averaging them with the unbiased 1992 results. bThe cut score is the overall average of the weighted medians for the groups contrasted. |
|||
TABLE 5-10 Comparison of Weighted Median Scaled Scores for Groups Contrasted to Determine the QCG Cut Scores for Advanced Literacy
|
Groups Contrasted |
Median Scorea |
||
|
1992 |
2003 |
||
|
Prose Literacy |
|||
|
Education: |
|||
|
|
High school diploma |
274 |
262 |
|
|
College degree |
327 |
316 |
|
|
Average of medians |
300.5 |
289.0 |
|
Occupational status: |
|||
|
|
Low formal training requirements |
267 |
261 |
|
|
High formal training requirements |
324 |
306 |
|
|
Average of medians |
295.5 |
283.5 |
|
Contrasting groups cut score for prose: 292.1b |
|||
|
Document Literacy |
|||
|
Education: |
|||
|
|
High school diploma |
267 |
259 |
|
|
College degree |
319 |
304.5 |
|
|
Average of medians |
293.0 |
281.8 |
|
Occupational status: |
|||
|
|
Low formal training requirements |
264 |
258 |
|
|
High formal training requirements |
315 |
298 |
|
|
Average of medians |
289.5 |
278.0 |
|
Contrasting groups cut score for document: 285.6 |
|||
sional, managerial, or technical occupations. Therefore, the bookmark cut score between intermediate and advanced literacy was compared with the contrast between those who have a high school diploma (or GED) and those who graduated from college. We expected that completing postsecondary education would be related to occupation. Thus, for each type of literacy, we determined the median score for occupations with minimal formal training requirements (e.g., laborer, assembler, fishing, farming) and those occupations that require formal training or education (e.g., manager, professional, technician). These QCG cut scores for prose (292.1), document (285.6), and quantitative (296.1) literacy appear in Table 5-10.
In examining the relationships described above, it is important to note that for those who speak little English, the relationship between literacy
|
Groups Contrasted |
Median Scorea |
||
|
1992 |
2003 |
||
|
Quantitative Literacy |
|||
|
Education: |
|||
|
|
High school diploma |
275 |
270 |
|
|
College degree |
326 |
324 |
|
|
Average of medians |
300.5 |
297.0 |
|
Occupational status: |
|||
|
|
Low formal training requirements |
269 |
267 |
|
|
High formal training requirements |
323 |
315 |
|
|
Average of medians |
296.0 |
291.0 |
|
Contrasting groups cut score for quantitative: 296.1 |
|||
|
aFor 1992, the median scores are calculated on a sample representing the entire adult population. For 2003, the median scores are calculated on a sample that excludes respondents with no responses to literacy tasks due to various “literacy-related reasons,” as determined by the interviewer. These excluded respondents correspond to roughly 2 percent of the adult population. Assuming that these respondents are at the lower end of the literacy scale (since they do not have answers for literacy-related reasons), their exclusion causes an upward bias in the calculated medians as an estimate of the true median of the full adult population. The impact of this bias on the standard setting is likely to be small for two reasons. First, a comparison of the medians for 1992 and 2003 suggest that the medians are relatively close and that the bias is probably not large. Second, the averaging procedure in the QCG calculation dilutes the effect of the biased 2003 results by averaging them with the unbiased 1992 results. bThe cut score is the overall average of the weighted medians for the groups contrasted. |
|||
levels in English and educational attainment in the home country may be skewed, since it is possible to have high levels of education from one’s home country yet not be literate in English. To see if inclusion of non-English speakers would skew the results in any way, we examined the medians for all test takers and just for English speakers. There were no meaningful differences among the resulting medians; thus we decided to report medians for the full aggregated dataset.
Most authorities on standard setting (e.g., Green, Trimble, and Lewis, 2003; Hambleton, 1980; Jaeger, 1989; Shepard, 1980; Zieky, 2001) suggest that, when setting cut scores, it is prudent to use and compare the
results from different standard-setting methods. At the same time, they acknowledge that different methods, or even the same method replicated with different panelists, are likely to produce different cut scores. This presents a dilemma to those who must make decisions about cut scores. Geisinger (1991, p. 17) captured this idea when he noted that “running a standard-setting panel is only the beginning of the standard-setting process.” At the conclusion of the standard setting, one has only proposed cut scores that must be accepted, rejected, or adjusted.
The standard-setting literature contains discussions about how to proceed with making decisions about proposed cut scores, but there do not appear to be any hard and fast rules. Several quantitative approaches have been explored. For example, in the early 1980s, two quantitative techniques were devised for “merging” results from different standard-setting procedures (Beuck, 1984; Hofstee, 1983). These methods involve obtaining additional sorts of judgments from the panelists, besides the typical standard-setting judgments, to derive the cut scores. In the Beuck technique, panelists are asked to make judgments about the optimal pass rate on the test. In the Hofstee approach, panelists are asked their opinions about the highest and lowest possible cut scores and the highest and lowest possible failing rate.8
Another quantitative approach is to set reasonable ranges for the cut scores and to make adjustments within this range. One way to establish a range is by using estimates of the standard errors of the proposed cut scores (Zieky, 2001). Also, Huff (2001) described a method of triangulating results from three standard-setting procedures in which a reasonable range was determined from the results of one of the standard-setting methods. The cut scores from the two other methods fell within this range and were therefore averaged to determine the final set of cut scores.
While these techniques use quantitative information in determining final cut scores, they are not devoid of judgments (e.g., someone must decide whether a quantitative procedure should be used, which one to use and how to implement it, and so on). Like the standard-setting procedure itself, determination of final cut scores is ultimately a judgment-based task that authorities on standard setting maintain should be based on both quantitative and qualitative information.
For example, The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, and the National Council on Measurement in Education, 1999, p. 54) note that determining cut scores cannot be a “purely technical mat-
ter,” indicating that they should “embody value judgments as well as technical and empirical considerations.” In his landmark article on certifying students’ competence, Jaeger (1989, p. 500) recommended considering all of the results from the standard setting together with “extra-statistical factors” to determine the final cut scores. Geisinger (1991) suggests that a panel composed of informed members of involved groups should be empowered to make decisions about final cut scores. Green et al. (2003) proposed convening a separate judgment-based procedure wherein a set of judges synthesizes the various results to determine a final set of cut scores or submitting the different sets of cut scores to a policy board (e.g., a board of education) for final determination.
As should be obvious from this discussion, there is no consensus in the measurement field about ways to determine final cut scores and no absolute guidance in the literature that the committee could rely on in making final decisions about cut scores. Using the advice that can be gleaned from the literature and guidance from the Standards that the process should be clearly documented and defensible, we developed an approach for utilizing the information from the two bookmark standard-setting sessions and the QCG procedure to develop our recommendations for final cut scores.
We judged that the cut scores resulting from the two bookmark sessions were sufficiently similar to warrant combining them, and we formed median cut scores based on the two sets of panelist judgments. Since we decided to use the cut scores from the QCG procedure solely to complement the information from the bookmark procedure, we did not want to combine these two sets of cut scores in such a way that they were accorded equal weight. There were two reasons for this. One reason, as described above, was that the background questions used for the QCG procedure were correlates of the constructs evaluated on the assessment and were not intended as direct measures of these constructs. Furthermore, as explained earlier in this chapter, the available information was not ideal and did not include questions that would be most useful in distinguishing between certain levels of literacy.
The other reason related to our judgment that the bookmark procedure had been implemented appropriately according to the guidelines documented in the literature (Hambleton, 2001; Kane, 2001; Plake, Melican, and Mills, 1992: Raymond and Reid, 2001) and that key factors had received close attention. We therefore chose to use a method for combining the results that accorded more weight to the bookmark cut scores than the QCG cut scores.
The cut scores produced by the bookmark and QCG approaches are summarized in the first two rows of Table 5-11 for each type of literacy. Comparison of these cut scores reveals that the QCG cut scores are always lower than the bookmark cut scores. The differences among the two sets of
TABLE 5-11 Summary of Cut Scores Resulting from Different Procedures
|
|
Basic |
Intermediate |
Advanced |
|
Prose |
|||
|
QCG cut score |
207.1 |
243.5 |
292.1 |
|
Bookmark cut score |
211 |
270 |
345 |
|
Interquartile range of bookmark cut score |
206-221 |
264-293 |
336-366 |
|
Adjusted cut scores |
211.0 |
267.0 |
340.5 |
|
Average of cut scores |
209.1 |
256.8 |
318.6 |
|
Confidence interval for cut scores |
200.5-225.5 |
261.3-290.0 |
333.5-377.8 |
|
Document |
|||
|
QCG cut score |
205.1 |
241.6 |
285.6 |
|
Bookmark cut score |
203 |
254 |
345 |
|
Interquartile range of bookmark cut score |
192-210 |
247-259 |
324-371 |
|
Adjusted cut scores |
203.0 |
250.5 |
334.5 |
|
Average of cut scores |
204.1 |
247.8 |
315.3 |
|
Confidence interval for cut scores |
191.9-211.1 |
245.5-264.8 |
320.7-372.9 |
|
Quantitative |
|||
|
QCG cut score |
209.9 |
245.4 |
296.1 |
|
Bookmark cut score |
244 |
296 |
356 |
|
Interquartile range of bookmark cut score |
230-245 |
288-307 |
343-398 |
|
Adjusted cut scores |
237.0 |
292.0 |
349.5 |
|
Average of cut scores |
227.0 |
275.2 |
326.1 |
|
Confidence interval for cut scores |
225.9-262.5 |
282.8-305.8 |
343.0-396.4 |
cut scores are smaller for the basic and intermediate performance levels for prose and document literacy, with differences ranging from 2 to 26 points. Differences among the cut scores are somewhat larger for all performance levels in the quantitative literacy area and for the advanced performance level for all three types of literacy, with differences ranging from 34 to 60 points. Overall, this comparison suggests that the bookmark cut scores should be lowered slightly.
We designed a procedure for combining the two sets of cut scores that was intended to make only minor adjustments to the bookmark cut scores, and we examined its effects on the resulting impact data. The adjustment procedure is described below and the resulting cut scores are also presented in Table 5-11. The table also includes the cut scores that would result from averaging the bookmark and QCG cut scores, which, although we did not consider this as a viable alternative, we provide as a comparison with the cut scores that resulted from the adjustment.
We devised a procedure for adjusting the bookmark cut scores that involved specifying a reasonable range for the cut scores and making adjustments within this range. We decided that the adjustment should keep the cut scores within the interquartile range of the bookmark cut scores (that is, the range encompassed by the 25th and 75th percentile scaled scores produced by the bookmark judgments) and used the QCG cut scores to determine the direction of the adjustment within this range. Specifically, we compared each QCG cut score to the respective interquartile range from the bookmark procedure. If the cut score lay within the interquartile range, no adjustment was made. If the cut score lay outside the interquartile range, the bookmark cut score was adjusted using the following rules:
If the QCG cut score is lower than the lower bound of the interquartile range (i.e., lower than the 25th percentile), determine the difference between the bookmark cut score and the lower bound of the interquartile range. Reduce the bookmark cut score by half of this difference (essentially, the midpoint between the 25th and 50th percentiles of the bookmark cut scores).
If the QCG cut score is higher than the upper bound of the interquartile range (i.e., higher than the 75th percentile), determine the difference between the bookmark cut score and the upper bound of the interquartile range. Increase the bookmark cut score by half of this difference (essentially the midpoint between the 50th and 75th percentile of the bookmark cut scores).
To demonstrate this procedure, the QCG cut score for the basic performance level in prose is 207.1, and the bookmark cut score is 211 (see Table 5-11). The corresponding interquartile range based on the bookmark procedure is 206 to 221. Since 207.1 falls within the interquartile range, no adjustment is made. The QCG cut score for intermediate is 243.5. Since 243.5 is lower than the 25th percentile score (interquartile range of 264 to 293), the bookmark cut score of 270 needs to be reduced. The amount of the reduction is half the difference between the bookmark cut score of 270 and the lower bound of the interquartile range (264), which is 3 points. Therefore, the bookmark cut score would be reduced from 270 to 267.
Application of these rules to the remaining cut scores indicates that all of the bookmark cut scores should be adjusted except the basic cut scores for prose and document literacy. The adjusted cut scores produced by this adjustment are presented in Table 5-11.
In 1992, the test designers noted that the break points determined by the analyses that produced the performance levels did not necessarily occur at exact 50-point intervals on the scales. As we described in Chapter 3, the test designers judged that assigning the exact range of scores to each level would imply a level of precision of measurement that was inappropriate for the methodology adopted, and they therefore rounded the cut scores. In essence, this rounding procedure reflected the notion that there is a level of uncertainty associated with the specification of cut scores.
The procedures we used for the bookmark standard setting allowed determination of confidence intervals for the cut scores, which also reflect the level of uncertainty in the cut scores. Like the test designers in 1992, we judged that the cut scores should be rounded and suggest that they be rounded to multiples of five. Tables 5-12a, 5-12b, and 5-12c show, for prose, document, and quantitative literacy, respectively, the original cut scores from the bookmark procedure and the adjustment procedure after rounding to the nearest multiple of five. For comparison, the table also presents the confidence intervals for the cut scores to indicate the level of uncertainty associated the specific cut scores.
Another consideration when making use of cut scores from different standard-setting methods is the resulting impact data; that is, the percentages of examinees who would be placed into each performance category based on the cut scores. Tables 5-12a, 5-12b, and 5-12c show the percentage of the population who scored below the rounded cut scores. Again for comparison purposes, the table also presents impact data for the confidence intervals.
Impact data were examined for both the original cut scores that resulted from the bookmark procedure and for the adjusted values of the cut scores. Comparison of the impact results based on the original and adjusted cut scores shows that the primary effect of the adjustment was to slightly lower the cut scores, more so for quantitative literacy than the other sections. A visual depiction of the differences in the percentages of adults classified into each performance level based on the two sets of cut scores is presented in Figures 5-1 through 5-6, respectively, for the prose, document, and quantitative sections. The top bar shows the percentages of adults that would be placed into each performance level based on the adjusted cut scores, and the bottom bar shows the distribution based on the original bookmark cut scores.
Overall, the adjustment procedure tended to produce a distribution of participants across the performance levels that resembled the distribution produced by the original bookmark cut scores. The largest changes were in
the quantitative section, in which the adjustment slightly lowered the cut scores. The result of the adjustment is a slight increase in the percentages of individuals in the basic, intermediate, and advanced categories.
In our view, the procedures used to determine the adjustment were sensible and served to align the bookmark cut scores more closely with the relevant background measures. The adjustments were relatively small and made only slight differences in the impact data. The adjusted values remained within the confidence intervals. We therefore recommend the cut scores produced by the adjustment.
RECOMMENDATION 5-1: The scale score intervals associated with each of the levels should be as shown below for prose, document, and quantitative literacy.
|
|
Nonliterate in English |
Below Basic |
Basic |
Intermediate |
Advanced |
|
Prose: |
Took ALSA |
0-209 |
210-264 |
265-339 |
340-500 |
|
Document: |
Took ALSA |
0-204 |
205-249 |
250-334 |
335-500 |
|
Quantitative: |
Took ALSA |
0-234 |
235-289 |
290-349 |
350-500 |
We remind the reader that the nonliterate in English category was intended to comprise the individuals who were not able to answer the core questions in 2003 and were given the ALSA instead of NAAL. Below basic is the lowest performance level for 1992, since the ALSA did not exist at that time.9
With respect to setting achievement levels on the NAAL, we found that there were significant problems at both the lower and upper ends of the literacy scale. The problems with the lower end relate to decisions about the
TABLE 5-12a Comparison of Impact Data for Prose Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores
|
|
Basic |
Intermediate |
Advanced |
|
Roundeda bookmark cut score |
210 |
270 |
345 |
|
Percent below cut score: |
|||
|
1992 |
46.8 |
87.4 |
|
|
2003 |
46.8 |
88.8 |
|
|
Roundeda adjusted cut score |
210 |
265 |
340 |
|
Percent below cut score: |
|||
|
1992 |
43.7 |
85.7 |
|
|
2003 |
43.6 |
87.1 |
|
|
Roundede confidence interval |
201-226 |
261-290 |
334-378 |
|
Percent below cut scores: |
|||
|
1992 |
41.2-59.4 |
83.2-95.6 |
|
|
2003 |
40.9-60.1 |
84.6-96.5 |
|
|
aRounded to nearest multiple of five. bIncludes those who took NALS and scored below the cut score as well as those who were not able to participate in the assessment for literacy-related reasons (having difficulty with reading or writing or unable to communicate in English or Spanish); nonparticipants for literacy-related reasons comprised 3 percent of the sample in 1992. cThis is an underestimate because it does not include the 1 percent of individual who could not participate due to a mental disability such as retardation, a learning disability, or other mental/emotional conditions. An upper bound on the percent below basic could be obtained by including this percentage. dIncludes those who took NAAL and scored below the basic cut score, those who took ALSA, and those who were not able to participate in the assessment for literacy-related reasons (having difficulty with reading or writing or unable to communicate in English or Spanish); nonparticipants for literacy-related reasons comprised 2 percent of the sample in 2003. eRounded to nearest whole number. |
|||
nature of the ALSA component. ALSA was implemented as a separate low-level assessment. ALSA and NAAL items were not analyzed or calibrated together and hence were not placed on the same scale. We were therefore not able to use the ALSA items in our procedures for setting the cut scores. These decisions about the ways to process ALSA data created a de facto cut score between the nonliterate in English and below basic categories. Consequently, all test takers in 2003 who performed poorly on the initial screening questions (the core questions) and were administered ALSA are classified into the nonliterate in English category (see footnote 9).
TABLE 5-12b Comparison of Impact Data for Document Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores
|
|
Basic |
Intermediate |
Advanced |
|
Roundeda bookmark cut score |
205 |
255 |
345 |
|
Percent below cut score: |
|||
|
1992 |
40.8 |
89.2 |
|
|
2003 |
39.4 |
91.1 |
|
|
Roundeda adjusted cut score |
205 |
250 |
335 |
|
Percent below cut score |
|||
|
1992 |
16.8 |
37.8 |
85.8 |
|
2003 |
14.2 |
36.1 |
87.7 |
|
Roundede confidence interval |
192-211 |
246-265 |
321-373 |
|
Percent below cut scores: |
|||
|
1992 |
12.9-18.9 |
35.5-47.0 |
79.9-95.6 |
|
2003 |
10.5-16.3 |
33.7-46.0 |
81.6-96.9 |
|
See footnotes to Table 15-12a. |
|||
TABLE 5-12c Comparison of Impact Data for Quantitative Literacy Based on Rounded Bookmark Cut Scores, Rounded Adjusted Cut Scores, and Rounded Confidence Interval for Cut Scores
|
|
Basic |
Intermediate |
Advanced |
|
Roundeda bookmark cut score |
245 |
300 |
355 |
|
Percent below cut score: |
|||
|
1992 |
65.1 |
89.3 |
|
|
2003 |
61.3 |
88.6 |
|
|
Roundeda adjusted cut score |
235 |
290 |
350 |
|
Percent below cut score |
|||
|
1992 |
28.5 |
59.1 |
87.9 |
|
2003 |
23.1 |
55.1 |
87.0 |
|
Roundede confidence interval |
226-263 |
283-306 |
343-396 |
|
Percent below cut scores: |
|||
|
1992 |
24.7-42.9 |
55.0-68.5 |
85.6-97.1 |
|
2003 |
19.2-37.9 |
50.5-64.9 |
84.1-97.2 |
|
See footnotes to Table 15-12a. |
|||
FIGURE 5-1 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 prose literacy.
FIGURE 5-2 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 prose literacy.
*The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category.
FIGURE 5-3 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 document literacy.
FIGURE 5-4 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 document literacy.
*The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category.
FIGURE 5-5 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 1992 quantitative literacy.
FIGURE 5-6 Comparison of the percentages of adults in each performance level based on the bookmark cut scores and adjusted cut scores for 2003 quantitative literacy.
*The nonliterate in English category comprises 4.7% of the 2003 population. This percentage plus those in the below basic category would be equivalent to the 1992 below basic category.
This creates problems in making comparisons between the 1992 and 2003 data. Since ALSA was not a part of NALS in 1992, there is no way to identify the group of test takers who would have been classified into the nonliterate in English category. As a result, the below basic and nonliterate in English categories will need to be combined to examine trends between 1992 and 2003.
With regard to the upper end of the scale, we found that feedback from the bookmark panelists, combined with our review of the items, suggests that the assessment does not adequately cover the upper end of the distribution of literacy proficiency. We developed the description of this level based on what we thought was the natural progression of skills beyond the intermediate level. In devising the wording of the description, we reviewed samples of NALS items and considered the 1992 descriptions of NALS Levels 4 and 5. A number of panelists in the bookmark procedure commented about the lack of difficulty represented by the items, however, particularly the quantitative items. A few judged that an individual at the advanced level should be able to answer all of the items correctly, which essentially means that these panelists did not set a cut score for the advanced category. We therefore conclude that the assessment is very weak at the upper end of the scale. Although there are growing concerns about readiness for college-level work and preparedness for entry into professional and technical professions, we think that NAAL, as currently designed, will not allow for detection of problems at these levels of proficiency. It is therefore with some reservations that we include the advanced category in our recommendation for performance levels, and we leave it to NCES to ultimately decide on the utility and meaning of this category.
With regard to the lower and upper ends of the score scale, we make the following recommendation:
RECOMMENDATION 5-2: Future development of NAAL should include more comprehensive coverage at the lower end of the continuum of literacy skills, including assessment of the extent to which individuals are able to recognize letters and numbers and read words and simple sentences, to allow determination of which individuals have the basic foundation skills in literacy and which individuals do not. This assessment should be part of NAAL and should yield information used in calculating scores for each of the three types of literacy. At the upper end of the continuum of literacy skills, future development of NAAL should also include assessment items necessary to identify the extent to which policy interventions are needed at the postsecondary level and above.