EDUCATIONAL EVALUATION : Evolution of Evaluation

Evolution of Evaluation

While evaluation as a profession is new, evaluation activity began long ago, perhaps as curly as Adam and Eve. , Evaluation is a method used to determine the value or worth of something. It is a process humans use to make decisions. It is also an imperfect process. As humans. We evaluate with the information available to us, which is often incomplete and nearly always with our a clear picture of implication and consequence. Eve made the decision to eat from the forbidden tree, evaluating the information that she had and obviously weighting one source more than another. Het Information was conflicting and she did not foresee the consequences of her decision, but it was evaluative nonetheless. Some researchers look back further and place the roots of evaluation with evolutionary biology (Shadish, Cook, & Leviton, 1990). It is reasonable to consider that evaluation is at play when species mutate to adopt new characteristics as a survival adaptation, as with evolutionary developmental biology (Evo-Devo). Evo-Devo, no relation to the 1970s rock band Devo, is the study of when, how, and to what extent genes are turned on to maximize survivability through natural selection (Public Broadcasting Service, 2009). However, evaluation as an activity to improve processes, programs, and policies has more modest roots.

FIVE DECADES OF TEST-BASED EDUCATIONAL

Test-based reforms can readily be traced back to the middle of the 19th century, when the Massachusetts state superintendent of instruction used written examinations as a means of holding public schools accountable for results (Resnick, 1982). In the early part of the 20th century, Joseph Rice (1914) administered spelling and mathematics tests to thousands of school children in a series of studies that raised questions about the efficiency of the use of instructional time. Numerous achievement test batteries published following World War I made use of the multiple-choice technology that came into widespread use during the war. Although there was substantial growth in the use of tests between World Wars I and II, the expansion accelerated after World War II. particularly from the 1960s to the present.

In recognition of the large disparities in educational opportunities and in student performance, considerable attention was focused on compensatory education in the mid-1960s. The Elementary and Secondary Education Act (ESEA) of 1965 put in place the largest and most enduring of these federal efforts in this realm commonly known as Title I. Although It was reshaped in some important ways in both the 1994 and 2001 reauthorizations (the Improving America's School Act of 1994 and the NCLB Act of 2001), Title I continues to be the largest federal program of assistance to elementary and secondary education. The congressional demands for evaluation and accountability for the funds distrib uted under Title 1 of ESEA, as well as several other programs of the 1960s, proved to be a boon to test publishers. The testing requirements of the Title I Evaluation and Reporting System (TIERS) contributed to a substantial expansion in the use of published standardized tests. Rather than administering tests once a year in selected grades, TIERS encouraged the administration of tests in both the fall and the spring for Title I students, to evaluate the progress of students participating in the program. Although little use was made of the aggregate test results, these TIERS requirements relieved for a time the pressure from demands for accountability for this major federal program. However, dissatisfaction with the progress made in student achievement, especially for the students that Title 1 is intended to serve, contributed to the substantial increases in testing and accountability provisions in the NCLB Act of 2001 that were briefly described earlier.

Minimum-Competency Testing

Perceived shortcomings of the skills of high school graduates led to the rapid introduction of additional testing requirements in the 1970s and early 1980s. Minimum-competency testing (MCT) reforms swiftly spread from state to state. In a single decade (1973-1983). the number of states with some form of MCT requirement went from 2 to 34. As the name suggests, MCT programs focused on basic skills that were considered to be the minimal essentials for the next grade level or a high school diploma. Minimal basic skills, although not easy to define or defend, were widely accepted as a reasonable requirement for high school graduation. However, in a landmark case in Florida (Debra P. v. Turlington, 1979), the court ruled that students had to be given adequate notice of the testing (2 years) and that the state had to demonstrate that the students had an opportunity to learn the mate rial tested. Although several of the MCT high school graduation requirements instituted by states in the 1970s and 1980s are still in place, the recognition of the need to consider more than minimum levels of performance soon led to other testing and assessment demands.

A Nation at Risk

Another wave of test-based reforms followed closely on the heels of the MCT movement. This round of reform efforts stressed school-level accountability and attempted to push beyond minimums. The test-based reforms of the middle and late 1980s were encouraged by a number of reports on the status of education that were completed in 1983. A Nation at Risk: The Imperative for Educational Reform, issued by the National Commission on Excellence in Education (1983), was probably the best known and most influential of these. That report featured tests in two ways: (1) to document shortcomings in student achievement, and (2) as a recommended mechanism of reform.

All 50 states introduced some type of educational reform in the wake of A Nation at Risk Consistent with the emphasis of the report, testing was central in the majority of state legislated reform efforts. Indeed, in many cases, externally mandated tests were relied on as the major instrument of reform. Many of the reforms involved an expansion of the use of test results for accountability purposes. Accountability programs took a variety of forms but shared the common characteristic that they increased real or perceived stakes of results for teachers and educational administrators.

Building and district "report cards" showing student test performance were used to make educators more accountable for student achievement. As intended, test-based comparisons of schools and districts placed considerable pressure on school superin tendents, principals, and teachers to "get the scores up." Test preparation became a major component in the instructional programs of many schools. Teachers reported in surveys that, as the result of the pressure, they focused their instruction on the skills tested, taught test-taking skills, and used the format of the externally mandated test in their own tests. The focus sometimes narrowed to the specific topics known to be on the mandated test, and practice was provided on items similar to those in the test. Under high-stakes testing conditions, topics corresponding to important instructional objectives not included on the test often were found to fall by the wayside as the testing date approached (see, for example, Nolan, Haladyna, & Hass, 1992, Shepard, 2000; Smith & Rottenberg, 1991).

Although some states and districts contracted for or developed their own tests, the accountability systems of the 1980s relied heavily on published standardized tests. Upward trends in student achievement were reported by an overwhelming majority of states and districts during the first few years of accountability testing programs. A physician, John Cannell (1987), forcefully brought to public attention what came to be known as the "Lake Wobegon effect," that is, the incredible finding that essentially all states and most districts were reporting that their students were scoring above the national norm. The Lake Wobegon effect received considerable publicity. This finding that almost all states using standard ized tests in the elementary grades were reporting that the majority of their students were above the national average has generally been attributed to a combination of placing great pressure on getting scores up and the reuse of the same test with old norms year after year (Linn, Graue, & Sanders, 1990). The Lake Wobegon effect raised serious questions about the credibility of test results and about the possible negative side effects of high-stakes accountability uses of standardized test results.

component in the instructional programs of many schools. Teachers reported in surveys that, as the result of the pressure, they focused their instruction on the skills tested, taught test-taking skills, and used the format of the externally mandated test in their own tests. The focus sometimes narrowed to the specific topics known to be on the mandated test, and practice was provided on items similar to those in the test. Under high-stakes testing conditions, topics corresponding to important instructional objectives not included on the test often were found to fall by the wayside as the testing date approached (see, for example, Nolan, Haladyna, & Hass, 1992, Shepard, 2000; Smith & Rottenberg, 1991).

Standards-Based Reform

The wave of reform in the 1990s continued to emphasize accountability but added some significant new features. Perhaps the four most notable of the new features are the em phasis on (a) adopting ambitious, "world-class standards that both shape the assessments and define levels of acceptable performance; (b) using forms of assessment that require students to perform more substantial tasks (e.g., construct extended essay responses and conduct experiments) rather than only select answers on multiple-choice items; (c) the attachment of high-stakes accountability mechanisms for schools, teachers, and sometimes students; and (d) the inclusion of all students.

Content and Performance Standards.

Educational improvement must begin with a clear idea of what students are expected to learn. This premise underlies the standards-based efforts to improve American education. Standards are statements that specify what should be taught and what students should learn. Standards specify goals or expectations for students, but they do not mandate a particular curriculum, textbook, or instructional approach. There may be many ways of achieving the ends identified in the standards. The key to effective standards, however, is that they be specific enough to identify what students need to learn and to determine when the standards have been met.

These two purposes-identifying what students need to learn and determining when the standards are achieved by students-correspond to the two types of standards that are commonly distinguished: content standards and performance standards. Content standards specify the "what, whereas performance standards specify "how well." That is, content standards are public statements that specify what students should know and be able to do in specific content or subject-matter areas at identified points of their education (e.g., grade 4 reading or grade 8 mathematics). Performance standards are dependent on content standards but add the specification of the level of performance that students are expected to achieve in relationship to the content standards. In other words, they answer the question, How good is good enough? Ideally, "they indicate both the nature of the evidence (such as an essay, mathematical proof, scientific experiment, project, exam, or combination of these) required to demonstrate that content standards have been met and the quality of student performance that will be deemed acceptable (what merits a passing or an 'A' grade)" (National Educational Goals Panel, 1991, p. 22).

We will have more to say about standards in Chapter 3. For present purposes, however, it is sufficient to note that with encouragement from content-specific teachers' associations (eg, the National Council of Teachers of Mathematics and the National Coun cil of Teachers of English) and the federal government through the Goals 2000 legislation, almost every state developed and adopted some form of content standards during the 1990s. In many states, the content standards have served as the basis for developing assessments that are intended to be "aligned" with the standards.

Performance-Based Assessment

Coinciding with and reinforced by the movement to develop content and performance standards was the substantial press throughout the 1990s for the development and use of "new" approaches to assessment, variously referred to as alternative assessment, authentic assessment, direct assessment, or performance based assessment. Each qualifier stresses a different aspect of the assessments: authentic stresses an emphasis on "real-world" tasks relevant outside the classroom, alternative stresses something other than the familiar multiple-choice test, and performance stresses the actual doing of a task (e.g., writing an essay or doing a hands-on experiment) rather than merely recognizing or knowing a right answer. Whatever the qualifier, assessment is intended to suggest a shift from fixed-response, machine-scored tests to the use of tasks requiring students to construct responses.

Calls for the increased reliance on performance-based assessment generally rest on. three premises that were articulated by Resnick and Resnick (1992). The first premise is characterized by the acronym WYTIWYG (What You Test Is What You Get). The second premise is the converse of this: "You do not get what you do not assess. The third prem ise is a logical conclusion that follows from acceptance of the first two: "Make assessments worth teaching to" (Resnick & Resnick, 1992, p. 59).

These premises are coupled with an acceptance of the argument that high-stakes testing and assessment shapes instruction and student leaming. Rather than trying to change that con nection, proponents of performance-based assessment argue that it is assessments that need to be modified not only to eliminate the negative effects of teaching to the assessment but also to make that activity have the desired result of enhanced student learning.

Hight-Stakes Accountability Mechanisms.

Attaching high stakes to the results of assess ments, although not new, has become increasingly popular with policymakers in states and districts throughout the country. The high-stakes accountability provisions of the NCLB Act of 2001 differ from the past in that they come from the federal level and now apply to all states, but are in keeping with the trend toward ratcheting up the stakes attached to test results for schools apparent in many states for most of the 1990s. More often than not, the stakes have applied primarily to educators by using the results of tests to determine rewards and sanctions for schools. For example, some programs identified schools that received not only special recognition but also monetary rewards for improved performance of students on the mandated state or district assessments. In some instances, those monetary rewards could be shared by teachers in the school. For schools where performance on assessments did not improve or even declined, various types of sanctions have been imposed by states. Examples of sanctions imposed through assessment-based accountability systems include bringing in an external team to oversee the school, reassigning teachers to other schools, and removing principals. Under the provisions of the NCLB Act, tutoring, expanded time for instruction either after school or during the summer, and public school choice may be provided to students in low-performing schools. Schools where students score below established target levels may be restructured and teachers and administrators may be replaced.

The stakes for individual students have also been increased in a number of states in recent years. Because of phase-in schedules, the requirements for students that may affect high school graduation, the type of diploma a student receives, or grade-to-grade promo tion have not been fully implemented in all cases, but the movement toward increased requirements is widespread. Tougher grade-to-grade promotion and graduation require ments appear at first blush like a replay of the minimum-competency testing movement of the late 1970s and early 1980s. They differ, however, in that the new requirements that are envisioned are intended to set more ambitious "world-class performance standards.

Inclusion of All Students

A prominent feature of standards-based educational efforts is the emphasis on high expectations for all students. Past practices of excluding a relatively large percentage of students from state and district standardized test programs because of limited English proficiency, because they recently moved to a state or district, or because of student disabilities are incompatible with the push to include all students. The goal of including all students in the assessment requires the use of multiple strategies. First, many students who would have been excluded in the past can in fact participate in assessments without any special considerations or adaptations of procedures. For those students, only a commitment to include is needed rather than allowing students to be excluded when convenient. Inclusion is a prominent part of the requirements of the NCLB Act. Schools that test less than 95% of their eligible students will be placed in the needs-improvement category regardless of how well the students who are tested perform on the tests.

Many students who would likely have been excluded in the past can be included with minor accommodations of the assessment. Some accommodations, such as extended time to complete the assessment, are ones that may lead to changes in assessment conditions that improve the validity of the assessment for all students. For example, when speed is not the issue, untimed assessments or ones with generous time limits may increase valid ity and fairness for all students and at the same time reduce the need to offer extended time to complete the assessment for some students.

More extensive modifications are clearly needed for some students to meaningfully participate in the assessment. English-language learners who are proficient in another language, for example, may have knowledge and skills in a content area other than reading or writing in English that can be assessed in the student's first language but cannot be reasonably assessed in English. Accommodations are also needed for some students with disabilities. The nature and extent of accommodation needed clearly depends on the kind and severity of a student's disability. Large-print and Braille versions of an assessment are obvious adaptations for students with visual impairments. Students with some types of physical handicaps may require someone to record their responses for them. However, by far the largest fraction of students who were excluded from assessments in the past because of disabilities that require individual education plans (IEPS) are students with learning disabilities.

Many students with IEPs for learning disabilities are likely to be able to take parts or all of standards-based assessments without special accommodations. No single type of accommodation will be appropriate for all those students with learning disabilities that require some kind of accommodation. Rather, several different approaches are likely to be needed. Among the more common suggested accommodations for students with learning disabilities are shorter assessments, more time for completing assessment tasks, oral read ing of directions, assessment in small groups or individually, and oral responses to tasks. Classroom teachers face issues of making appropriate accommodations in both their instruction and classroom assessment in working with students with IEPS. Although the IEP is the essential source for guiding the decisions that teachers must make in these regards, it is clear that considerable professional judgment on the part of teachers is also required. The guiding principle for accommodations on an assessment, whether it is externally mandated or one developed by the teacher for classroom use, is that the accommodations be comparable to the ones required by the student's IEP for instruction.

No Child Left Behind

The fifth consecutive decade of test-based educational reform was ushered in by the NCLB Act of 2001. Although this act is a reauthorization of the ESEA of 1965, its extensive test ing requirements apply to all public schools, not just to those receiving Title I funds. The NCLB Act reinforces the role of content and performance standards. Specifically, the law requires each state to "demonstrate that the State has adopted challenging content. standards and challenging student performance standards that will be used by the State" (PL 107-110, Section 1111[b] [1] [A]).

As noted previously, almost every state had adopted content standards, and most states had some tests in place for selected grades and subjects that were intended to measure their content standards. Most states had set performance standards on their tests. There is great. variability among the states, however, in the breadth, depth, and specificity of their content standards. There is also considerable variability in the stringency of the performance stan dards that have been set by states (for a discussion of variability in state performance stan dards, see Linn, 2003). To satisfy the requirements of the NCLB Act, however, states were required to submit plans justifying the claim that their content standards are challenging and that they have in place challenging performance standards, referred to in the law as student academic achievement standards. The student academic achievement standards must be "aligned with the State's academic content standards... describe two levels of high achieve ment (proficient and advanced) that determine how well children are mastering the State academic content standards; and describe a third level of achievement (basic) to provide. complete information about the progress of lower-achieving children toward mastering the proficient and advanced levels of achievement" (P.L 107-110, Section 1111[b] [1] [D] [i]).

The performance standards set by states for use under the provisions of the NCLB Act are of consequence because they are used to set intermediate annual achievement targets for student achievement such that all students will be at the proficient" level or higher by the 2013-2014 school year. The end goal of 100% proficient in 2014, together with state-established starting points in 2002, is used to define "adequate yearly progress" (AYP) tar gets. The comparison of student achievement for a school in reading or language arts and in mathematics to the AYP targets in those subjects determine whether a school will be identified as "needs improvement and be subject to the sanctions that apply to schools so designated.

The Question of Impact

The degree to which the increased pressure helped or hurt education is controversial. Proponents of high-stakes testing argue that the tests measure objectives that are impor tant for students to learn and that it is desirable for teachers to focus their attention on those objectives. They point with pride to the increases in test scores that were observed in state and district testing programs since the 1990s.

Critics of the increased emphasis on test results argue that an overreliance on test. Results distorts education. They argue that important objectives are ignored when they are Not included on the tests that count. Moreover, they claim that the increased scores paint a misleading picture because teachers teach the specifics of the tests rather than more general content domains (Koretz, 2005). The proper role of externally mandated tests and assessments in directing instruction is an issue with which you will struggle, especially if the current emphasis on holding educators accountable for results on these instruments continues. What sort of preparation for taking tests or assessments should students have? How much time should be spent in Test preparation activities, such as taking practice tests and learning test-taking strategies?To what degree should tested objectives be given emphasis at the expense of objectives. That are not tested? These are important educational questions that have no simple answers. They require thought and reflection on the part of individual teachers and principals.

Consider, for example, the seemingly simple issues of teaching to the test and “teaching the test itself.” We are almost always interested in making inferences that go beyond the specific test that is used. We would like, for example, to be able to say some thing about the degree of a student’s understanding of mathematical concepts based on the score that is obtained on a math concepts test. Because the items on a test only sam ple the domain of interest, the test score and the inference about the degree of under standing are not the same. A generalization is required, and it is the generalization, not the test score per se, that is important. When the specific items on the test are taught, the validity of the inference about the student’s level of achievement is threatened. Teaching the specific test items is apt to result in an exaggerated view of student achievement in the overall domain of interest.

Teaching to the test-that is, emphasizing the objectives that are on the test without teaching the specific test items-has both advantages and disadvantages. Many schools and districts have made teaching to the test more systematic by the introduction of “benchmark tests. These are tests that are similar to and keyed to the same content standards as the state mandated test, but are administered several times during the year prior to the administration of the state test. The benchmark test results are used to identify students who are likely to have difficulty on the state test, and teachers are expected to devote special attention to those students to help them perform at the proficient level or above on the state test. Inasmuch as the objectives on the test are important, emphasizing those objectives with or without the use of benchmark tests provides a desirable focus. On the other hand, multiple-choice,standardized tests do not cover all the important learning objectives. Hence, narrowing to only those objectives that are covered would be detrimental for education as a whole.

TECHNOLOGICAL ADVANCES IN TESTING AND ASSESSMENT

With the rapid growth in the availability and power of relatively low-cost microcomput ers, it is not surprising that the use of computers to administer tests is becoming increas ingly common. Indeed, some readers of this text may have taken or be planning to take the computer-administered Graduate Record Examination, the Academic Skills Assess ments of the Praxis Series: Professional Assessments for Beginning Teachers, or one of the other computer-based tests offered by the Educational Testing Service. A special issue of Education Week titled "Pencils Down: Technology's Answer to Testing," published in May 2003, was devoted to the use of computer technology to administer and score tests. The editors' introduction to the special issue argues that there has been a convergence of education and technology that has vaulted "computer-based testing into the headlines, raising important questions about whether this new mode of assessment is more useful than traditional paper-and-pencil exams" (The Editors, Education Week, May 8, 2003, p. 8). The editors note that the NCLB Act has created a new stimulus for schools to find more efficient ways of testing with faster turnaround than is possible with paper-and-pencil test ing technology. According to Education Week, 12 states and the District of Columbia had already launched some form of computer-based testing or pilot program when the special issue appeared. The use of computer-based testing has continued to grow since the pub lication of the 2003 special issue of Education Week During the 3 years from May 2003 to May 2006, the number of states with some form of computer-based assessment of students increased from 12 to 22 (Swanson, 2006, p. 55).

Using a computer to administer items from a paper-and-pencil test can have several advantages. Rather than waiting several weeks to receive test results, scores can be ob tained immediately. Computer-based testing also provides the means of tailoring the test to individual students by using performance on previously administered items to select the next item to administer. Although the start-up cost can be substantial, on-line,computer based testing can also result in substantial reductions in printing, distribution, and scoring expenses. The catch, however, is the need to have more computers available for use in the schools than are currently available in many schools. Judging from a recent special issue of Education Week on technology in the schools, which indicated that the national average was 3.8 students per instructional computer (Swanson, 2006. p. 51), it is reason able to believe that the typical school now has enough computers to make it feasible to administer computer-based tests. Computers to administer tests that count is one use and, in the short term, may not be the most common use of computer-based testing. There is an almost insatiable demand for the use of practice tests and other forms of test preparation. The flexibility and imme diate score reporting offered by on-line testing makes the technology especially appeal ing for purposes of test preparation. As discussed in later chapters, computer-based testing also has substantial potential for teachers for their own classroom assessments.

One change that has already been widely implemented is the use of the computer to administer adaptive tests, that is, tests in which the choice of the next item to administer is based on previous responses of the test taker. Adaptive tests can enhance both the efficiency with which information is handled and the quality of that information. The design of an adaptive test usually starts with the administration of an item that is expected to be of middle difficulty. The second and subsequent items to be administered are determined by the responses of the test taker. In general, if a test taker answers an item correctly, then the computer will select a somewhat more difficult item to administer next. Conversely, a somewhat easier item is administered following an incorrect response. Testing is stopped when the estimates of the individual’s performance reach some prede termined level of precision or when some pragmatic maximum number of items has been administered. A variety of computer software is available for administering tests (see the box “Illustrative On-Line and Adaptive Testing Software”).

It has been demonstrated that adaptive testing can enhance the efficiency and the pre cision with which certain types of knowledge, skills, and abilities are measured. In some cases, adaptive tests can obtain the same level of reliability of measurement in just over half the time required for a conventional paper-and-pencil test. If adaptive tests only ad minister items of the type already in use in a better way, however, the full potential of the use of computers for the administration of tests will not be realized. The attraction of computers as testing devices is not limited to doing better what we already do. Their potential to measure proficiencies that are not measured well by conventional paper-and pencil tests is even more appealing.

In the long run, the potentially more significant changes in testing as the result of computer-based testing depend on using the computer to do things that cannot be rea sonably accomplished with paper-and-pencil tests. The technology opens the door to the use of video simulations or problem settings where students access information from the Web or a CD in ways similar to instructional use of that technology. Simulations can be used to present test takers with problems that have greater real ism and more apparent relevance than problems that are commonly found on paper-and pencil tests. Computer-based examinations that have been used for some time in medical education and in certification testing for physicians provide an illustration of the type of simulation tests that are apt to be seen in the future as computerized test administration. becomes more common. Computer problems simulate aspects of the job of a physician. The test taker is initially presented with a limited set of information about a patient, such as a verbal description of symptoms of the type that a patient might provide at the start of 14 Part 1 The Measurement and Assessment Process a visit. The test taker then has a variety of options, such as getting a patient history. ordering laboratory tests, or deciding on a course of treatment. Requested information is provided, and new options can be followed by the test taker until a diagnosis is made and a course of treatment prescribed. Computer-administered problem simulations along these lines have potential advan tages over current paper-and-pencil tests in many content areas. They provide a means of going beyond the sort of factual recall that is sometimes overemphasized on paper-and pencil tests. They focus attention on the use of information to solve realistic problems. They can help assess not only the product of a student's thinking but also the process that the student uses to solve a problem, including the way in which the problem is attacked, the efficiency of the solution, and the number of hints that may be needed to solve the problem.

PUBLIC CONCERN ABOUT TESTING AND ASSESSMENT

Decisions about the selection, administration, and use of educational tests and assessments. Are no longer left to the educator alone. The public has become an active and vocal part ner. As discussed previously, externally mandated testing has been imposed on the schools by states or districts as a result of the public demand for evidence of the school programs effectiveness. In some states, the public at large has participated, through selected groups, in determining the objectives and standards of the statewide assessment programs. In states where testing and assessment has been made the responsibility of the local school district, parent groups often help shape the programs. It is interesting to note that the concern of state legislators and the general public with the quality of school programs has created a demand for more testing and assessment in the schools, not less

During the expansion of testing programs, the concern has been that there may be too much testing in the schools. In addition to taking the tests in the local school program, high school students, for example, may also have to take one or more state competency tests and several college admissions tests. It is feared that the heavy demand on their time and energy might detract from their schoolwork and that the external testing programs may cause undesirable shifts in the school’s curriculum. When teachers and schools are judged by how well students perform on state tests and assessments and by how many students are accepted by leading colleges, direct preparation for the tests and assessments is likely to enter into classroom activities and thereby distort the curriculum.

Probably the greatest public concern has been with the social consequences of testing, especially perceived threats to the rights and opportunities of individuals and groups. This concern has shown up in the form of attacks on standardized tests and the testing industry. New legislation affecting testing, calls for a moratorium on standardized testing, and charges that tests are biased and discriminatory. There are certainly some good reasons for the pub lic’s concern with the social consequences of testing. It is important, however, to distinguish between negative consequences for individuals or groups that are due to faults in the tests or assessments and ones that are caused by the misinterpretation and misuse of test scores.

Discussed next are four areas of concern resulting in controversy over testing and Assessment: (1) the nature and quality of tests, (2) the effects of testing on students, (3) fairness to minorities, and (4) gender fairness. Some of these issues, particularly the effects of testing and questions of fairness and bias, are also considered in subsequent chapters, in part because they are fundamental issues for all testing and assessment and in part because they can be dealt with more adequately after developing some fundamental concepts of testing and assessment, such as validity.

Nature and Quality of Tests

A long-standing criticism of standardized tests is directed primarily at the use of multiple choice items. In the early 1960s, critics such as Hoffman (1962) contended that the multiple choice item penalized the more intelligent original thinkers. He supported his claims by reviewing items from standardized tests and showing how some highly able and creative students were likely to see implications in the items not thought of by the test author(s) and thus question the correctness of the keyed answers. Although Hoffman obviously was able to discover some defective items that appeared in standardized tests, his criticisms seemed to go well beyond the evidence presented. He did, however, encourage test publishers to supplement statistical item analysis with a more careful, logical analysis of test items.

Multiple-choice questions continue to bear the brunt of criticisms made by both special ists in educational measurement who seek ways of improving educational tests and critics who would like to eliminate standardized testing. For example, Frederiksen (1984), a major contributor to the field of measurement, has argued that multiple-choice items place too much emphasis on “well structured problems” when problems of greatest interest both in and out of school are often “ill structured” and where skills such as problem identification and hypothesis generation are often as important as problem solution. Such criticisms have led to increased emphasis on open-ended questions and the design of computer simulation tests.

Another type of criticism that tests measure only limited aspects of an individual has also received considerable attention. This criticism is well founded. Tests do measure specific and limited samples of behavior. Aptitude tests typically measure samples of verbal and quantitative skills useful in predicting school success, and achievement tests. Measure samples of student performance on particular learning tasks useful in assessing educational progress. Both fulfill their limited functions well, but the difficulty arises when we expect more of them than was intended. For example, both the advocates and the critics of college admissions testing sometimes assume that the tests measure all that is needed for success in college and beyond. This tendency to read into test scores more than they really tell has been called the whole person fallacy” by W. W. Trunbull, the former president of the Educational Testing Service.

Effects of Testing on Students

Critics of testing argue that testing is likely to have certain undesirable effects on students. Some of the most commonly mentioned charges directed toward the use of aptitude and achievement tests are listed here with brief comments.

Criticism 1: Tests Create Anxiety. There is no doubt that anxiety increases during testing. For most students, it motivates them to perform better. For a few, test anxiety may be so great that it interferes with test performance. These typically are students who are generally anxious, and the test simply adds to their already high level of anxiety. A number of steps can be taken to reduce test anxiety, such as thoroughly preparing for the test, taking practice exercises, and using liberal time limits. Fortunately, many test publishers in recent years have provided practice tests and shifted from speed tests to power tests. This should help, but it is still necessary to observe students carefully during testing and to discount the scores of overly anxious students.

Criticism 2 Tests Categorize and Label Students Categorizing and labeling individuals can be a serious problem, particularly when those labels are used as an excuse for poor student achievement rather than a means of providing the extra services and help to ensure better achievement. It is all too easy to place individuals in pigeonholes and apply labels that determine, at least in part, how they are viewed and treated. Classifying students in terms of levels of mental ability has probably caused the greatest concern in education. When students are classified as mentally retarded, for example, it influences how teachers and peers view them, how they view themselves, and the kind of school

Programs they receive. When students are mislabeled as mentally retarded, as has been the case with some racial and ethnic minorities, the problem is compounded. At least some of the support for mainstreaming handicapped students has come from the desire to avoid the categorizing and labeling that accompanies special education classes.

Classifying students into various types of learning groups can more efficiently use the teacher’s time and the school’s resources. However, when grouping, teachers must take. Into account that tests measure only a limited sample of a student’s abilities and that stu dents are continuously changing and developing. By keeping the groupings tentative and flexible and regrouping for different subjects (e.g., reading and math), teachers can avoid most of the undesirable features of grouping. It is when the categories are viewed as rigid and permanent that labeling becomes a serious problem. In such cases, it is not the test that should be blamed but the user of the test.

Criticism 3: Tests Damage Students’ Self-Concepts. This is a concern that requires the attention of teachers, counselors, and other users of tests. The improper use of tests may indeed contribute to distorted self-concepts. The stereotyping of students is one misuse of tests that is likely to have an undesirable influence on a student’s self-concept. Another is the inadequate interpretation of test scores that may cause students to make unwarranted generalizations from the results. It is certainly discouraging to receive low scores tests, and it is easy to see how students might develop a general sense of failure unless the results are properly interpreted. Low-scoring students need to be made aware that aptitude and achievement tests are limited measures and that the results can change. In addition, the possibility of overgeneralizing from low test scores will be lessened if the student’s positive accomplishments and characteristics are mentioned during the interpretation. When properly interpreted and used, tests can help students develop a realistic understanding of their strengths and weaknesses and thereby contribute to improved learning and a positive self-image.

Criticism 4: Tests Create Self-Fulfilling Prophecies. This criticism has been directed primarily toward intelligence or scholastic aptitude tests. The argument is that test scores create teacher expectations concerning the achievement of individual students; the teacher then teaches in accordance with those expectations, and the students respond by achieving to their expected level-a self-fulfilling prophecy. Thus, those who are expected to achieve more do achieve more, and those who are expected to achieve less do achieve less. This so-called Pygmalion effect received strong support from a widely heralded study by Rosenthal and Jacobsen (1968), even though the study was later challenged by other researchers (Flashoff & Snow, 1971; West & Anderson, 1976). The belief that teacher expectations enhance or hinder a student's achievement is widely held, and the role of testing in creating these expectations is certainly worthy of further research. In summary, there is some merit in the various criticisms concerning the possible undesirable effects of tests on students; but more often than not, these criticisms should. be directed at the users of the tests rather than the tests themselves. The same persons who misuse test results are likely to misuse alternative types of information that are even less accurate and objective. Thus, the solution is not to stop using tests but to start using tests and other data sources of information more effectively. When tests are used in a positive manner that is, to help students improve their learning and development-the consequences are likely to be desirable rather than undesirable,

Fairness of Tests to Minorities

The issue of test fairness to racial and ethnic minorities is a critical issue for any assess ment program. Fairness has received increasing attention over the years in the literature on testing and assessment. Concern with the fairness of tests has paralleled the general public concern with providing equal rights and opportunities to all U.S. citizens. An analy sis of the discussions of fairness in testing and assessment makes it evident that the con cept of fairness is used in many different ways. Testing and assessment professionals have often equated fairness with an absence of bias-considering, for example, a test use as fair if predictions of nontest performance are comparable for different racial or ethnic groups. Fairness may also focus on the equity in treatment of persons from different groups in the assessment process. This second notion of fairness is sometimes referred to as procedural fairness and involves questions such as the following: Do test takers have an equal opportunity to show what they know and can do on a test? Are responses to essay questions graded comparably by raters without regard to the test taker's group membership? A third meaning of fairness would require that students be provided with an adequate or equal opportunity to learn the material that is assessed. A fourth meaning of fairness that is common in some popular uses of the term is the equality of results. From this perspective, a test would be considered fair only if the average performance of different groups (e.g., African Americans, Latinos, and whites) was the same. These four conceptions of fairness-absence of bias, procedural fairness, opportunity to learn, and equality of results are consistent with the discussion of the topic in the latest revision of the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 1999). The different perspectives can lead to quite different conclusions about the fairness of any test or assessment. The fourth conception, equality of results, is incompatible with other tenets of testing and assessment, such as the goal of getting a valid and reliable measure of what students know and can do regardless of their background or group membership. Inasmuch as different groups of students differ in the instruction they have received, their experiences both in and out of school, and their interest and effort, one cannot expect a valid measure to show no difference between the groups. In other words, a test or assessment that shows average differences in scores for minority and majority group students may fairly reflect the consequences of unfair treatment of minorities by the society. An absence of bias and procedural fairness is essential for an assessment to have a high level of validity in measuring the knowledge, skills, and understandings that it is intended to measure. In other words, those characteristics are essential to avoid unfaimess due to faulty measurement. Whether professionals in testing and assessment would consider adequacy or equality of opportunity to learn as essential for fairness generally depends not only on the instrument but also on the uses and interpretations that are made of the results. Thus, it would be con sidered fair to use results of an achievement test to document inequalities in education where there were substantial differences in opportunity to learn. But it would not be fair to use the test results as the basis for rewards or sanctions for students under those conditions. Nor would it be fair to infer that the group that was not provided with an adequate opportunity to learn was intellectually inferior or incapable of learning the material on the assessment. It is often useful to distinguish between (a) the possible presence of bias in the test content and (b) the possible unfair use of test results. These factors are undoubtedly related, but we discuss them here separately.

In evaluating the possible presence of bias in test content, it is important to distinguish between the performance the test is intended to measure and factors that may distort the scores unfairly. In testing mathematics skills with story problems, for example, it is impor tant to keep the reading level low so that the test scores are not contaminated by reading ability. If the reading level is too difficult, poor readers will obtain lower scores than war ranted, and the test will be biased against them. Because a particular minority group may have a disproportionately large number of poor readers, the test may be biased against that minority group. But if the test of mathematics skills is not contaminated by reading or other factors, low scores will simply indicate lack of mathematics skills. Such a test is fair to everyone even if the test scores indicate group differences in the mastery of mathematics.

In the past, standardized tests typically emphasized content and values that were more familiar to white middle-class students than to racial or ethnic minorities and students of lower socioeconomic status. Thus, the content of some scholastic aptitude tests contained vocabulary, pictures, and objects that minorities had less opportunity to learn. Similarly, some reading tests contained stories and situations that were unrelated to their life expe riences. Racial and ethnic minorities were seldom represented in pictures, stories, and other test content, and when they were, it was sometimes in an offensive manner. How much these types of bias might have lowered the scores of individual students is impos sible to say, but most persons familiar with testing would acknowledge some adverse effect. Fortunately, test publishers have taken steps to correct the situation. Test publish ers now employ staff members representing various racial and cultural minorities, and new tests are routinely reviewed for content that might be biased or offensive to minority groups. Statistical analysis is also being used to detect and remove test items that function

Differently for different groups of test takers (Cole & Moss, 1989). The most controversial problems concerning the fair use of tests with minority groups are encountered when aptitude tests are used as a basis for educational and vocational selection. Much of the difficulty here is with the definition of fair test use. One view is that a test is fair or unbiased if it predicts as accurately for minority groups as it does for the majority group. This traditional view, which favors a common cutoff score for selection, has been challenged as being unfair to minority groups because they often earn lower test scores, and thus a smaller proportion of qualified individuals tends to be selected. Alter native definitions of test fairness favor some type of adjustment, such as separate cutoff scores or bonus points, for some minorities.

Ultimately, the decision as to whether minority group membership is to be ignored or given special consideration in selection will not be made by educators or psychologists. The fair use of tests in selection is part of a larger issue that must be settled by society through court rulings. Stated in simplified form, the issue is how equal educational and occupational opportunities can best be provided for members of minority groups without infringing on the rights of other.

Gender Fairness

The issues of fairness regarding testing and assessment have also attracted considerable attention in the use and interpretation of tests and assessments for males and females. For example, the use of scores on tests such as the Preliminary SAT (PSAT) as the basis for iden tifying National Merit Scholars has focused attention on the issue of gender bias in tests in recent years. On the mathematics section of the PSAT, the average score for males has been higher than the average score for females, and there have been more males than females with very high scores for many years. In recent years, the average score for males is also slightly higher than the average for females on the verbal section of these tests. As a con sequence, a substantially larger percentage of male National Merit Scholars have been iden tified than female scholars. As a partial response to this difference, writing scores, on which females tend to score higher than males, have been added to the mix for identifying National Merit Scholars, thereby reducing the gender difference in recognition of scholars. As in the case of differences in scores of racial or ethnic groups, the existence of a dif ference in average scores for males and females does not necessarily imply that the test is biased. There are, for example, differences in the number of mathematics courses taken by females and males in high school, and these differences may lead to differences in the scores on the mathematics tests (Willingham & Cole, 1997). Whatever the cause of the differences, however, it does not change the fact that use of the tests alone results in a larger propor tion of scholarships being awarded to males than to females despite the fact that females earn higher grades in school than males on average. Judgments about the proper use of test scores must rest on more than technical evaluations of the tests or the degree to which the scores reflect real differences in knowledge, skills, and developed abilities rather than un

Intentional biases. Such judgments also involve questions of social values and social policy. Although not a minority group, women have had some of the same problems as minorities have in attempting to obtain equal educational and occupational opportunities. Thus, in the use of test results in career planning, care needs to be taken so that test scores are not unfairly used to direct females away from certain occupations. For example, females tend to score lower than males on mechanical comprehension tests and to have lower mechanical interest scores. Although these differences probably reflect cultural influences rather than sex bias in the tests, it would be unfortunate if such results were used to limit the occupations females might consider as possible careers,

Willingham and Cole (1997) have reported a comprehensive set of analyses examin ing differences in the performance of males and females on a variety of types of tests and assessments. They also have investigated factors related to the differences found and com parability in the validity of the measures when they are used for various purposes, such as predicting subsequent performance of males and females.

Formative Assessments

Several distinctions can be made between formative assessments and mandated tests. First, classroom teachers control the timing and use of formative assessments, whereas man dated tests are controlled from afar and must be administered on a schedule dictated by the state or district. Second, formative assessments are flexible and may be adapted to the instructional needs of individual students, whereas external tests are standardized and uniform for all students. Moreover, as noted, external tests provide evaluations of student learning whereas formative assessments provide measures for learning; that is, the main purpose of formative assessments is to provide information that can be used to guide instruction and enhance student learning.

To be effective tools of teaching and learning, formative assessments must be consis tent with important student learning goals (see, for example, Shepard, 2001). Teachers must be able to control the time that formative assessments are administered and the choice of tasks that students are asked to perform. The assessments must provide feed back to students and teachers (Black & Wiliam, 1998). Rubrics that teachers use to score constructed responses to assessment tasks must be transparent and made part of the feed back to students. Clear instructional goals and specifications of assessment tasks, together with transparent scoring rubrics, can contribute to improved learning by helping students use self-assessments to monitor their own progress.

SUMMARY

You are likely to encounter a number of externally mandated testing and assessment

Programs in schools. As a teacher, you may be directly involved in some of the programs.

Others you may simply need to know about so that you can serve as an informed

Professional in dealing with students, parents, and the public. These programs

Generally contributed to expanded testing and assessment in the schools, which in turn

Have created new trends and raised concerns. You will also have opportunities to construct

And use formative assessments. When properly used, the latter assessments can be effec

Tive tools for enhancing instruction and student learning. Subsequent chapters provide

Elaborations of these ideas.

Demands for accountability have led to substantial increases in the amount of test

Ing and assessment in the schools and in the importance that is attached to the scores.

Minimum-competency testing programs that required students to pass tests to receive

High school diplomas or to be promoted to the next grade grew rapidly throughout the

Country in the late 1970s. Demands for comparisons of schools, school districts, and

States in terms of student achievement test scores led to the introduction of still more

Testing and assessment requirements and increased the stakes for teachers and school

Administrators.

In recent years, emphasis has been placed on establishing demanding content and

Performance standards for all students. The demanding performance standards have

Increased the stakes associated with assessment-based accountability systems in a number

Of states. At the same time, the pressure to include all students has presented new chal

Lenges in developing assessments and administration procedures that provide adequate

Accommodations for students with special needs or students who are English-language

Learners.

The NCLB Act of 2001 has requirements for increases in the amount of testing that

States had mandated before the act became law. It reinforces the role of content and

Student performance standards. Most important, the act requires states to use student

Test results to hold schools and districts accountable for student achievement in reading

Or language arts and mathematics and, beginning with the 2007-2008 school year, in

Science.

Several changes in educational measurement have been broad enough and persistent

Enough to be identified as trends. In addition to the expanded uses of tests that have just

Been summarized, developments are changing the nature of testing and assessment. Most

Notable of these developments is the increased emphasis on performance-based assess

Ment. Concerns that tests and assessments strongly influence what gets taught have led to

Increased emphasis on making assessments that correspond as closely as possible com

Plex learning objectives that are not readily measured by conventional tests. Demands for

New forms of assessment have led to changes in approaches being used to measure

Teacher performance as well as student performance.

Technological developments also promise to change the nature of testing and assess

Ment. Computers provide a means of making tests adaptive to the individual test taker and

Constructing realistic simulations to test problem-solving skills. On-line, computer-based

Testing has certain potential advantages over traditional paper-and-pencil tests. In addition

To making adaptive testing, on-line testing provides immediate access to results and

May be especially useful as part of an early warning system and in preparing for high

Stakes tests. On-line testing may also provide cost savings. The biggest hurdle to wider

Implementation of computer-based, on-line testing is the demand for access to computers

In the schools that exceeds current availability in many places.

Critics of testing have raised issues concerning the possible consequences of testing

In the schools. Most of the criticism has been directed toward standardized tests, including

Such issues as the nature and quality of the tests, the possible harmful effects of testing on

Students, and the fairness of tests to minorities and women. It is important to recognize

The multiple perspectives on what constitutes faimess in testing and assessment, including

The absence of bias, procedural fairness, adequacy of opportunity to learn, and equality

Of outcomes. Although the nature of the test or assessment itself requires close attention.

In considering issues of fairness, in many instances it is the use or misuse of the results of

Tests and assessments that is most critical to achieving fairness.

Formative assessments constructed by classroom teachers differ from external tests in

A number of ways. They are less formal and may be administered at different times

Throughout the school year as deemed appropriate by the needs of individual students.

They can contribute to improvements in instruction and student learning by providing

Timely feedback and by clarifying expectations in ways that students can use for purposes.

Of ongoing self-assessment.