Evolution of Evaluation
While evaluation as
a profession is new, evaluation activity began long ago, perhaps as curly as
Adam and Eve. , Evaluation is a method used to determine the value or worth of
something. It is a process humans use to make decisions. It is also an
imperfect process. As humans. We evaluate with the information available to us,
which is often incomplete and nearly always with our a clear picture of
implication and consequence. Eve made the decision to eat from the forbidden tree,
evaluating the information that she had and obviously weighting one source more
than another. Het Information was conflicting and she did not foresee the
consequences of her decision, but it was evaluative nonetheless. Some
researchers look back further and place the roots of evaluation with
evolutionary biology (Shadish, Cook, & Leviton, 1990). It is reasonable to
consider that evaluation is at play when species mutate to adopt new
characteristics as a survival adaptation, as with evolutionary developmental
biology (Evo-Devo). Evo-Devo, no relation to the 1970s rock band Devo, is the
study of when, how, and to what extent genes are turned on to maximize
survivability through natural selection (Public Broadcasting Service, 2009).
However, evaluation as an activity to improve processes, programs, and policies
has more modest roots.
FIVE DECADES OF
TEST-BASED EDUCATIONAL
Test-based reforms can readily be traced back to the middle
of the 19th century, when the Massachusetts state superintendent of instruction
used written examinations as a means of holding public schools accountable for
results (Resnick, 1982). In the early part of the 20th century, Joseph Rice
(1914) administered spelling and mathematics tests to thousands of school
children in a series of studies that raised questions about the efficiency of
the use of instructional time. Numerous achievement test batteries published
following World War I made use of the multiple-choice technology that came into
widespread use during the war. Although there was substantial growth in the use
of tests between World Wars I and II, the expansion accelerated after World War
II. particularly from the 1960s to the present.
In recognition of the large disparities in educational
opportunities and in student performance, considerable attention was focused on
compensatory education in the mid-1960s. The Elementary and Secondary Education
Act (ESEA) of 1965 put in place the largest and most enduring of these federal
efforts in this realm commonly known as Title I. Although It was reshaped in
some important ways in both the 1994 and 2001 reauthorizations (the Improving
America's School Act of 1994 and the NCLB Act of 2001), Title I continues to be
the largest federal program of assistance to elementary and secondary education.
The congressional demands for evaluation and accountability for the funds
distrib uted under Title 1 of ESEA, as well as several other programs of the
1960s, proved to be a boon to test publishers. The testing requirements of the
Title I Evaluation and Reporting System (TIERS) contributed to a substantial
expansion in the use of published standardized tests. Rather than administering
tests once a year in selected grades, TIERS encouraged the administration of
tests in both the fall and the spring for Title I students, to evaluate the
progress of students participating in the program. Although little use was made
of the aggregate test results, these TIERS requirements relieved for a time the
pressure from demands for accountability for this major federal program.
However, dissatisfaction with the progress made in student achievement,
especially for the students that Title 1 is intended to serve, contributed to
the substantial increases in testing and accountability provisions in the NCLB
Act of 2001 that were briefly described earlier.
Minimum-Competency
Testing
Perceived shortcomings of the skills of high school
graduates led to the rapid introduction of additional testing requirements in
the 1970s and early 1980s. Minimum-competency testing (MCT) reforms swiftly
spread from state to state. In a single decade (1973-1983). the number of
states with some form of MCT requirement went from 2 to 34. As the name
suggests, MCT programs focused on basic skills that were considered to be the
minimal essentials for the next grade level or a high school diploma. Minimal
basic skills, although not easy to define or defend, were widely accepted as a
reasonable requirement for high school graduation. However, in a landmark case
in Florida (Debra P. v. Turlington, 1979), the court ruled that students had to
be given adequate notice of the testing (2 years) and that the state had to
demonstrate that the students had an opportunity to learn the mate rial tested.
Although several of the MCT high school graduation requirements instituted by
states in the 1970s and 1980s are still in place, the recognition of the need
to consider more than minimum levels of performance soon led to other testing
and assessment demands.
A Nation at Risk
Another wave of test-based reforms followed closely on the
heels of the MCT movement. This round of reform efforts stressed school-level
accountability and attempted to push beyond minimums. The test-based reforms of
the middle and late 1980s were encouraged by a number of reports on the status
of education that were completed in 1983. A Nation at Risk: The Imperative for
Educational Reform, issued by the National Commission on Excellence in
Education (1983), was probably the best known and most influential of these.
That report featured tests in two ways: (1) to document shortcomings in student
achievement, and (2) as a recommended mechanism of reform.
All 50 states introduced some type of educational reform in
the wake of A Nation at Risk Consistent with the emphasis of the report,
testing was central in the majority of state legislated reform efforts. Indeed,
in many cases, externally mandated tests were relied on as the major instrument
of reform. Many of the reforms involved an expansion of the use of test results
for accountability purposes. Accountability programs took a variety of forms
but shared the common characteristic that they increased real or perceived
stakes of results for teachers and educational administrators.
Building and district "report cards" showing
student test performance were used to make educators more accountable for
student achievement. As intended, test-based comparisons of schools and
districts placed considerable pressure on school superin tendents, principals,
and teachers to "get the scores up." Test preparation became a major
component in the instructional programs of many schools. Teachers reported in
surveys that, as the result of the pressure, they focused their instruction on
the skills tested, taught test-taking skills, and used the format of the
externally mandated test in their own tests. The focus sometimes narrowed to
the specific topics known to be on the mandated test, and practice was provided
on items similar to those in the test. Under high-stakes testing conditions,
topics corresponding to important instructional objectives not included on the
test often were found to fall by the wayside as the testing date approached
(see, for example, Nolan, Haladyna, & Hass, 1992, Shepard, 2000; Smith
& Rottenberg, 1991).
Although some states and districts contracted for or
developed their own tests, the accountability systems of the 1980s relied
heavily on published standardized tests. Upward trends in student achievement
were reported by an overwhelming majority of states and districts during the
first few years of accountability testing programs. A physician, John Cannell
(1987), forcefully brought to public attention what came to be known as the
"Lake Wobegon effect," that is, the incredible finding that
essentially all states and most districts were reporting that their students
were scoring above the national norm. The Lake Wobegon effect received
considerable publicity. This finding that almost all states using standard ized
tests in the elementary grades were reporting that the majority of their
students were above the national average has generally been attributed to a
combination of placing great pressure on getting scores up and the reuse of the
same test with old norms year after year (Linn, Graue, & Sanders, 1990).
The Lake Wobegon effect raised serious questions about the credibility of test
results and about the possible negative side effects of high-stakes
accountability uses of standardized test results.
component in the instructional programs of many schools.
Teachers reported in surveys that, as the result of the pressure, they focused
their instruction on the skills tested, taught test-taking skills, and used the
format of the externally mandated test in their own tests. The focus sometimes
narrowed to the specific topics known to be on the mandated test, and practice
was provided on items similar to those in the test. Under high-stakes testing
conditions, topics corresponding to important instructional objectives not
included on the test often were found to fall by the wayside as the testing date
approached (see, for example, Nolan, Haladyna, & Hass, 1992, Shepard, 2000;
Smith & Rottenberg, 1991).
Although some states and districts contracted for or
developed their own tests, the accountability systems of the 1980s relied
heavily on published standardized tests. Upward trends in student achievement
were reported by an overwhelming majority of states and districts during the
first few years of accountability testing programs. A physician, John Cannell
(1987), forcefully brought to public attention what came to be known as the
"Lake Wobegon effect," that is, the incredible finding that
essentially all states and most districts were reporting that their students
were scoring above the national norm. The Lake Wobegon effect received considerable
publicity. This finding that almost all states using standard ized tests in the
elementary grades were reporting that the majority of their students were above
the national average has generally been attributed to a combination of placing
great pressure on getting scores up and the reuse of the same test with old
norms year after year (Linn, Graue, & Sanders, 1990). The Lake Wobegon
effect raised serious questions about the credibility of test results and about
the possible negative side effects of high-stakes accountability uses of
standardized test results.
Standards-Based
Reform
The wave of reform in the 1990s continued to emphasize
accountability but added some significant new features. Perhaps the four most
notable of the new features are the em phasis on (a) adopting ambitious,
"world-class standards that both shape the assessments and define levels
of acceptable performance; (b) using forms of assessment that require students
to perform more substantial tasks (e.g., construct extended essay responses and
conduct experiments) rather than only select answers on multiple-choice items;
(c) the attachment of high-stakes accountability mechanisms for schools,
teachers, and sometimes students; and (d) the inclusion of all students.
Content and
Performance Standards.
Educational improvement must begin with a clear idea of what
students are expected to learn. This premise underlies the standards-based
efforts to improve American education. Standards are statements that specify
what should be taught and what students should learn. Standards specify goals
or expectations for students, but they do not mandate a particular curriculum,
textbook, or instructional approach. There may be many ways of achieving the
ends identified in the standards. The key to effective standards, however, is
that they be specific enough to identify what students need to learn and to
determine when the standards have been met.
These two purposes-identifying what students need to learn
and determining when the standards are achieved by students-correspond to the
two types of standards that are commonly distinguished: content standards and
performance standards. Content standards specify the "what, whereas
performance standards specify "how well." That is, content standards
are public statements that specify what students should know and be able to do
in specific content or subject-matter areas at identified points of their
education (e.g., grade 4 reading or grade 8 mathematics). Performance standards
are dependent on content standards but add the specification of the level of
performance that students are expected to achieve in relationship to the
content standards. In other words, they answer the question, How good is good
enough? Ideally, "they indicate both the nature of the evidence (such as an
essay, mathematical proof, scientific experiment, project, exam, or combination
of these) required to demonstrate that content standards have been met and the
quality of student performance that will be deemed acceptable (what merits a
passing or an 'A' grade)" (National Educational Goals Panel, 1991, p. 22).
We will have more to say about standards in Chapter 3. For
present purposes, however, it is sufficient to note that with encouragement
from content-specific teachers' associations (eg, the National Council of
Teachers of Mathematics and the National Coun cil of Teachers of English) and
the federal government through the Goals 2000 legislation, almost every state
developed and adopted some form of content standards during the 1990s. In many
states, the content standards have served as the basis for developing
assessments that are intended to be "aligned" with the standards.
Performance-Based
Assessment
Coinciding with and reinforced by the movement to develop
content and performance standards was the substantial press throughout the
1990s for the development and use of "new" approaches to assessment,
variously referred to as alternative assessment, authentic assessment, direct
assessment, or performance based assessment. Each qualifier stresses a different
aspect of the assessments: authentic stresses an emphasis on
"real-world" tasks relevant outside the classroom, alternative
stresses something other than the familiar multiple-choice test, and
performance stresses the actual doing of a task (e.g., writing an essay or
doing a hands-on experiment) rather than merely recognizing or knowing a right
answer. Whatever the qualifier, assessment is intended to suggest a shift from
fixed-response, machine-scored tests to the use of tasks requiring students to construct
responses.
Calls for the increased reliance on performance-based
assessment generally rest on. three premises that were articulated by Resnick
and Resnick (1992). The first premise is characterized by the acronym WYTIWYG
(What You Test Is What You Get). The second premise is the converse of this:
"You do not get what you do not assess. The third prem ise is a logical
conclusion that follows from acceptance of the first two: "Make
assessments worth teaching to" (Resnick & Resnick, 1992, p. 59).
These premises are coupled with an acceptance of the
argument that high-stakes testing and assessment shapes instruction and student
leaming. Rather than trying to change that con nection, proponents of
performance-based assessment argue that it is assessments that need to be
modified not only to eliminate the negative effects of teaching to the
assessment but also to make that activity have the desired result of enhanced
student learning.
Hight-Stakes
Accountability Mechanisms.
Attaching high stakes to the results of assess ments,
although not new, has become increasingly popular with policymakers in states
and districts throughout the country. The high-stakes accountability provisions
of the NCLB Act of 2001 differ from the past in that they come from the federal
level and now apply to all states, but are in keeping with the trend toward
ratcheting up the stakes attached to test results for schools apparent in many
states for most of the 1990s. More often than not, the stakes have applied
primarily to educators by using the results of tests to determine rewards and
sanctions for schools. For example, some programs identified schools that
received not only special recognition but also monetary rewards for improved
performance of students on the mandated state or district assessments. In some
instances, those monetary rewards could be shared by teachers in the school.
For schools where performance on assessments did not improve or even declined,
various types of sanctions have been imposed by states. Examples of sanctions
imposed through assessment-based accountability systems include bringing in an
external team to oversee the school, reassigning teachers to other schools, and
removing principals. Under the provisions of the NCLB Act, tutoring, expanded
time for instruction either after school or during the summer, and public
school choice may be provided to students in low-performing schools. Schools
where students score below established target levels may be restructured and
teachers and administrators may be replaced.
The stakes for individual students have also been increased
in a number of states in recent years. Because of phase-in schedules, the
requirements for students that may affect high school graduation, the type of
diploma a student receives, or grade-to-grade promo tion have not been fully
implemented in all cases, but the movement toward increased requirements is
widespread. Tougher grade-to-grade promotion and graduation require ments
appear at first blush like a replay of the minimum-competency testing movement
of the late 1970s and early 1980s. They differ, however, in that the new
requirements that are envisioned are intended to set more ambitious
"world-class performance standards.
Inclusion of All
Students
A prominent feature of standards-based educational efforts
is the emphasis on high expectations for all students. Past practices of
excluding a relatively large percentage of students from state and district
standardized test programs because of limited English proficiency, because they
recently moved to a state or district, or because of student disabilities are
incompatible with the push to include all students. The goal of including all
students in the assessment requires the use of multiple strategies. First, many
students who would have been excluded in the past can in fact participate in
assessments without any special considerations or adaptations of procedures.
For those students, only a commitment to include is needed rather than allowing
students to be excluded when convenient. Inclusion is a prominent part of the
requirements of the NCLB Act. Schools that test less than 95% of their eligible
students will be placed in the needs-improvement category regardless of how
well the students who are tested perform on the tests.
Many students who would likely have been excluded in the
past can be included with minor accommodations of the assessment. Some
accommodations, such as extended time to complete the assessment, are ones that
may lead to changes in assessment conditions that improve the validity of the
assessment for all students. For example, when speed is not the issue, untimed
assessments or ones with generous time limits may increase valid ity and
fairness for all students and at the same time reduce the need to offer
extended time to complete the assessment for some students.
More extensive modifications are clearly needed for some
students to meaningfully participate in the assessment. English-language
learners who are proficient in another language, for example, may have
knowledge and skills in a content area other than reading or writing in English
that can be assessed in the student's first language but cannot be reasonably
assessed in English. Accommodations are also needed for some students with
disabilities. The nature and extent of accommodation needed clearly depends on
the kind and severity of a student's disability. Large-print and Braille
versions of an assessment are obvious adaptations for students with visual
impairments. Students with some types of physical handicaps may require someone
to record their responses for them. However, by far the largest fraction of
students who were excluded from assessments in the past because of disabilities
that require individual education plans (IEPS) are students with learning
disabilities.
Many students with IEPs for learning disabilities are likely
to be able to take parts or all of standards-based assessments without special
accommodations. No single type of accommodation will be appropriate for all
those students with learning disabilities that require some kind of
accommodation. Rather, several different approaches are likely to be needed.
Among the more common suggested accommodations for students with learning
disabilities are shorter assessments, more time for completing assessment
tasks, oral read ing of directions, assessment in small groups or individually,
and oral responses to tasks. Classroom teachers face issues of making
appropriate accommodations in both their instruction and classroom assessment
in working with students with IEPS. Although the IEP is the essential source
for guiding the decisions that teachers must make in these regards, it is clear
that considerable professional judgment on the part of teachers is also
required. The guiding principle for accommodations on an assessment, whether it
is externally mandated or one developed by the teacher for classroom use, is
that the accommodations be comparable to the ones required by the student's IEP
for instruction.
No Child Left Behind
The fifth consecutive decade of test-based educational
reform was ushered in by the NCLB Act of 2001. Although this act is a
reauthorization of the ESEA of 1965, its extensive test ing requirements apply
to all public schools, not just to those receiving Title I funds. The NCLB Act
reinforces the role of content and performance standards. Specifically, the law
requires each state to "demonstrate that the State has adopted challenging
content. standards and challenging student performance standards that will be
used by the State" (PL 107-110, Section 1111[b] [1] [A]).
As noted previously, almost every state had adopted content
standards, and most states had some tests in place for selected grades and
subjects that were intended to measure their content standards. Most states had
set performance standards on their tests. There is great. variability among the
states, however, in the breadth, depth, and specificity of their content
standards. There is also considerable variability in the stringency of the
performance stan dards that have been set by states (for a discussion of
variability in state performance stan dards, see Linn, 2003). To satisfy the
requirements of the NCLB Act, however, states were required to submit plans
justifying the claim that their content standards are challenging and that they
have in place challenging performance standards, referred to in the law as
student academic achievement standards. The student academic achievement
standards must be "aligned with the State's academic content standards...
describe two levels of high achieve ment (proficient and advanced) that
determine how well children are mastering the State academic content standards;
and describe a third level of achievement (basic) to provide. complete
information about the progress of lower-achieving children toward mastering the
proficient and advanced levels of achievement" (P.L 107-110, Section
1111[b] [1] [D] [i]).
The performance standards set by states for use under the
provisions of the NCLB Act are of consequence because they are used to set
intermediate annual achievement targets for student achievement such that all
students will be at the proficient" level or higher by the 2013-2014
school year. The end goal of 100% proficient in 2014, together with state-established
starting points in 2002, is used to define "adequate yearly progress"
(AYP) tar gets. The comparison of student achievement for a school in reading
or language arts and in mathematics to the AYP targets in those subjects
determine whether a school will be identified as "needs improvement and be
subject to the sanctions that apply to schools so designated.
The Question of Impact
The degree to which the increased pressure helped or hurt
education is controversial. Proponents of high-stakes testing argue that the
tests measure objectives that are impor tant for students to learn and that it
is desirable for teachers to focus their attention on those objectives. They
point with pride to the increases in test scores that were observed in state and
district testing programs since the 1990s.
Critics of the increased emphasis on test results argue that
an overreliance on test. Results distorts education. They argue that important
objectives are ignored when they are Not included on the tests that count.
Moreover, they claim that the increased scores paint a misleading picture
because teachers teach the specifics of the tests rather than more general
content domains (Koretz, 2005). The proper role of externally mandated tests
and assessments in directing instruction is an issue with which you will
struggle, especially if the current emphasis on holding educators accountable
for results on these instruments continues. What sort of preparation for taking
tests or assessments should students have? How much time should be spent in
Test preparation activities, such as taking practice tests and learning
test-taking strategies?To what degree should tested objectives be given
emphasis at the expense of objectives. That are not tested? These are important
educational questions that have no simple answers. They require thought and
reflection on the part of individual teachers and principals.
Consider, for example, the seemingly simple issues of
teaching to the test and “teaching the test itself.” We are almost always
interested in making inferences that go beyond the specific test that is used.
We would like, for example, to be able to say some thing about the degree of a
student’s understanding of mathematical concepts based on the score that is
obtained on a math concepts test. Because the items on a test only sam ple the
domain of interest, the test score and the inference about the degree of under
standing are not the same. A generalization is required, and it is the
generalization, not the test score per se, that is important. When the specific
items on the test are taught, the validity of the inference about the student’s
level of achievement is threatened. Teaching the specific test items is apt to
result in an exaggerated view of student achievement in the overall domain of
interest.
Teaching to the test-that is, emphasizing the objectives
that are on the test without teaching the specific test items-has both
advantages and disadvantages. Many schools and districts have made teaching to
the test more systematic by the introduction of “benchmark tests. These are
tests that are similar to and keyed to the same content standards as the state
mandated test, but are administered several times during the year prior to the
administration of the state test. The benchmark test results are used to
identify students who are likely to have difficulty on the state test, and
teachers are expected to devote special attention to those students to help
them perform at the proficient level or above on the state test. Inasmuch as the
objectives on the test are important, emphasizing those objectives with or
without the use of benchmark tests provides a desirable focus. On the other
hand, multiple-choice,standardized tests do not cover all the important
learning objectives. Hence, narrowing to only those objectives that are covered
would be detrimental for education as a whole.
TECHNOLOGICAL ADVANCES IN TESTING AND ASSESSMENT
With the rapid growth in the availability and power of
relatively low-cost microcomput ers, it is not surprising that the use of
computers to administer tests is becoming increas ingly common. Indeed, some
readers of this text may have taken or be planning to take the
computer-administered Graduate Record Examination, the Academic Skills Assess
ments of the Praxis Series: Professional Assessments for Beginning Teachers, or
one of the other computer-based tests offered by the Educational Testing
Service. A special issue of Education Week titled "Pencils Down:
Technology's Answer to Testing," published in May 2003, was devoted to the
use of computer technology to administer and score tests. The editors'
introduction to the special issue argues that there has been a convergence of
education and technology that has vaulted "computer-based testing into the
headlines, raising important questions about whether this new mode of
assessment is more useful than traditional paper-and-pencil exams" (The
Editors, Education Week, May 8, 2003, p. 8). The editors note that the NCLB Act
has created a new stimulus for schools to find more efficient ways of testing
with faster turnaround than is possible with paper-and-pencil test ing
technology. According to Education Week, 12 states and the District of Columbia
had already launched some form of computer-based testing or pilot program when
the special issue appeared. The use of computer-based testing has continued to
grow since the pub lication of the 2003 special issue of Education Week During
the 3 years from May 2003 to May 2006, the number of states with some form of
computer-based assessment of students increased from 12 to 22 (Swanson, 2006,
p. 55).
Using a computer to administer items from a paper-and-pencil
test can have several advantages. Rather than waiting several weeks to receive
test results, scores can be ob tained immediately. Computer-based testing also
provides the means of tailoring the test to individual students by using
performance on previously administered items to select the next item to
administer. Although the start-up cost can be substantial, on-line,computer
based testing can also result in substantial reductions in printing,
distribution, and scoring expenses. The catch, however, is the need to have
more computers available for use in the schools than are currently available in
many schools. Judging from a recent special issue of Education Week on
technology in the schools, which indicated that the national average was 3.8
students per instructional computer (Swanson, 2006. p. 51), it is reason able
to believe that the typical school now has enough computers to make it feasible
to administer computer-based tests.
Computers to administer tests that count is one use and, in the short
term, may not be the most common use of computer-based testing. There is an
almost insatiable demand for the use of practice tests and other forms of test
preparation. The flexibility and imme diate score reporting offered by on-line
testing makes the technology especially appeal ing for purposes of test
preparation. As discussed in later chapters, computer-based testing also has substantial
potential for teachers for their own classroom assessments.
One change that has already been widely implemented is the
use of the computer to administer adaptive tests, that is, tests in which the
choice of the next item to administer is based on previous responses of the
test taker. Adaptive tests can enhance both the efficiency with which
information is handled and the quality of that information. The design of an
adaptive test usually starts with the administration of an item that is expected
to be of middle difficulty. The second and subsequent items to be administered
are determined by the responses of the test taker. In general, if a test taker
answers an item correctly, then the computer will select a somewhat more
difficult item to administer next. Conversely, a somewhat easier item is
administered following an incorrect response. Testing is stopped when the
estimates of the individual’s performance reach some prede termined level of
precision or when some pragmatic maximum number of items has been administered.
A variety of computer software is available for administering tests (see the
box “Illustrative On-Line and Adaptive Testing Software”).
It has been demonstrated that adaptive testing can enhance
the efficiency and the pre cision with which certain types of knowledge,
skills, and abilities are measured. In some cases, adaptive tests can obtain
the same level of reliability of measurement in just over half the time
required for a conventional paper-and-pencil test. If adaptive tests only ad
minister items of the type already in use in a better way, however, the full
potential of the use of computers for the administration of tests will not be
realized. The attraction of computers as testing devices is not limited to
doing better what we already do. Their potential to measure proficiencies that
are not measured well by conventional paper-and pencil tests is even more
appealing.
In the long run, the potentially more significant changes in
testing as the result of computer-based testing depend on using the computer to
do things that cannot be rea sonably accomplished with paper-and-pencil tests.
The technology opens the door to the use of video simulations or problem
settings where students access information from the Web or a CD in ways similar
to instructional use of that technology.
Simulations can be used to present test takers with problems that have
greater real ism and more apparent
relevance than problems that are commonly found on paper-and pencil tests. Computer-based examinations
that have been used for some time in medical
education and in certification testing for physicians provide an
illustration of the type of simulation
tests that are apt to be seen in the future as computerized test
administration. becomes more common.
Computer problems simulate aspects of the job of a physician. The test taker is initially presented with a
limited set of information about a patient, such as a verbal description of symptoms of the
type that a patient might provide at the start of 14 Part 1 The Measurement and Assessment
Process a visit. The test taker then has
a variety of options, such as getting a patient history. ordering laboratory
tests, or deciding on a course of treatment. Requested information is provided,
and new options can be followed by the test taker until a diagnosis is made and
a course of treatment prescribed.
Computer-administered problem simulations along these lines have
potential advan tages over current paper-and-pencil tests in many content
areas. They provide a means of going beyond the sort of factual recall that is
sometimes overemphasized on paper-and pencil tests. They focus attention on the
use of information to solve realistic problems. They can help assess not only
the product of a student's thinking but also the process that the student uses
to solve a problem, including the way in which the problem is attacked, the
efficiency of the solution, and the number of hints that may be needed to solve
the problem.
PUBLIC CONCERN ABOUT TESTING AND ASSESSMENT
Decisions about the selection, administration, and use of
educational tests and assessments. Are no longer left to the educator alone.
The public has become an active and vocal part ner. As discussed previously,
externally mandated testing has been imposed on the schools by states or
districts as a result of the public demand for evidence of the school programs
effectiveness. In some states, the public at large has participated, through
selected groups, in determining the objectives and standards of the statewide
assessment programs. In states where testing and assessment has been made the
responsibility of the local school district, parent groups often help shape the
programs. It is interesting to note that the concern of state legislators and
the general public with the quality of school programs has created a demand for
more testing and assessment in the schools, not less
During the expansion of testing programs, the concern has
been that there may be too much testing in the schools. In addition to taking the
tests in the local school program, high school students, for example, may also
have to take one or more state competency tests and several college admissions
tests. It is feared that the heavy demand on their time and energy might
detract from their schoolwork and that the external testing programs may cause
undesirable shifts in the school’s curriculum. When teachers and schools are
judged by how well students perform on state tests and assessments and by how
many students are accepted by leading colleges, direct preparation for the
tests and assessments is likely to enter into classroom activities and thereby
distort the curriculum.
Probably the greatest public concern has been with the
social consequences of testing, especially perceived threats to the rights and
opportunities of individuals and groups. This concern has shown up in the form
of attacks on standardized tests and the testing industry. New legislation
affecting testing, calls for a moratorium on standardized testing, and charges
that tests are biased and discriminatory. There are certainly some good reasons
for the pub lic’s concern with the social consequences of testing. It is
important, however, to distinguish between negative consequences for
individuals or groups that are due to faults in the tests or assessments and
ones that are caused by the misinterpretation and misuse of test scores.
Discussed next are four areas of concern resulting in
controversy over testing and Assessment: (1) the nature and quality of tests,
(2) the effects of testing on students, (3) fairness to minorities, and (4)
gender fairness. Some of these issues, particularly the effects of testing and
questions of fairness and bias, are also considered in subsequent chapters, in
part because they are fundamental issues for all testing and assessment and in
part because they can be dealt with more adequately after developing some
fundamental concepts of testing and assessment, such as validity.
Nature and Quality of Tests
A long-standing criticism of standardized tests is directed
primarily at the use of multiple choice items. In the early 1960s, critics such
as Hoffman (1962) contended that the multiple choice item penalized the more
intelligent original thinkers. He supported his claims by reviewing items from
standardized tests and showing how some highly able and creative students were
likely to see implications in the items not thought of by the test author(s)
and thus question the correctness of the keyed answers. Although Hoffman
obviously was able to discover some defective items that appeared in
standardized tests, his criticisms seemed to go well beyond the evidence
presented. He did, however, encourage test publishers to supplement statistical
item analysis with a more careful, logical analysis of test items.
Multiple-choice questions continue to bear the brunt of
criticisms made by both special ists in educational measurement who seek ways
of improving educational tests and critics who would like to eliminate
standardized testing. For example, Frederiksen (1984), a major contributor to
the field of measurement, has argued that multiple-choice items place too much
emphasis on “well structured problems” when problems of greatest interest both
in and out of school are often “ill structured” and where skills such as
problem identification and hypothesis generation are often as important as
problem solution. Such criticisms have led to increased emphasis on open-ended
questions and the design of computer simulation tests.
Another type of criticism that tests measure only limited
aspects of an individual has also received considerable attention. This
criticism is well founded. Tests do measure specific and limited samples of
behavior. Aptitude tests typically measure samples of verbal and quantitative
skills useful in predicting school success, and achievement tests. Measure
samples of student performance on particular learning tasks useful in assessing
educational progress. Both fulfill their limited functions well, but the
difficulty arises when we expect more of them than was intended. For example,
both the advocates and the critics of college admissions testing sometimes
assume that the tests measure all that is needed for success in college and
beyond. This tendency to read into test scores more than they really tell has
been called the whole person fallacy” by W. W. Trunbull, the former president
of the Educational Testing Service.
Effects of Testing on Students
Critics of testing argue that testing is likely to have
certain undesirable effects on students. Some of the most commonly mentioned
charges directed toward the use of aptitude and achievement tests are listed
here with brief comments.
Criticism 1: Tests Create Anxiety. There is no doubt that
anxiety increases during testing. For most students, it motivates them to
perform better. For a few, test anxiety may be so great that it interferes with
test performance. These typically are students who are generally anxious, and
the test simply adds to their already high level of anxiety. A number of steps
can be taken to reduce test anxiety, such as thoroughly preparing for the test,
taking practice exercises, and using liberal time limits. Fortunately, many
test publishers in recent years have provided practice tests and shifted from
speed tests to power tests. This should help, but it is still necessary to
observe students carefully during testing and to discount the scores of overly
anxious students.
Criticism 2 Tests Categorize and Label Students Categorizing
and labeling individuals can be a serious problem, particularly when those
labels are used as an excuse for poor student achievement rather than a means
of providing the extra services and help to ensure better achievement. It is
all too easy to place individuals in pigeonholes and apply labels that determine,
at least in part, how they are viewed and treated. Classifying students in
terms of levels of mental ability has probably caused the greatest concern in
education. When students are classified as mentally retarded, for example, it
influences how teachers and peers view them, how they view themselves, and the
kind of school
Programs they receive. When students are mislabeled as
mentally retarded, as has been the case with some racial and ethnic minorities,
the problem is compounded. At least some of the support for mainstreaming
handicapped students has come from the desire to avoid the categorizing and
labeling that accompanies special education classes.
Classifying students into various types of learning groups
can more efficiently use the teacher’s time and the school’s resources.
However, when grouping, teachers must take. Into account that tests measure
only a limited sample of a student’s abilities and that stu dents are
continuously changing and developing. By keeping the groupings tentative and flexible
and regrouping for different subjects (e.g., reading and math), teachers can
avoid most of the undesirable features of grouping. It is when the categories
are viewed as rigid and permanent that labeling becomes a serious problem. In
such cases, it is not the test that should be blamed but the user of the test.
Criticism 3: Tests Damage Students’ Self-Concepts. This is a
concern that requires the attention of teachers, counselors, and other users of
tests. The improper use of tests may indeed contribute to distorted
self-concepts. The stereotyping of students is one misuse of tests that is
likely to have an undesirable influence on a student’s self-concept. Another is
the inadequate interpretation of test scores that may cause students to make
unwarranted generalizations from the results. It is certainly discouraging to
receive low scores tests, and it is easy to see how students might develop a
general sense of failure unless the results are properly interpreted.
Low-scoring students need to be made aware that aptitude and achievement tests
are limited measures and that the results can change. In addition, the
possibility of overgeneralizing from low test scores will be lessened if the
student’s positive accomplishments and characteristics are mentioned during the
interpretation. When properly interpreted and used, tests can help students
develop a realistic understanding of their strengths and weaknesses and thereby
contribute to improved learning and a positive self-image.
Criticism 4: Tests Create Self-Fulfilling Prophecies. This
criticism has been directed primarily toward intelligence or scholastic
aptitude tests. The argument is that test scores create teacher expectations
concerning the achievement of individual students; the teacher then teaches in
accordance with those expectations, and the students respond by achieving to
their expected level-a self-fulfilling prophecy. Thus, those who are expected
to achieve more do achieve more, and those who are expected to achieve less do
achieve less. This so-called Pygmalion effect received strong support from a
widely heralded study by Rosenthal and Jacobsen (1968), even though the study
was later challenged by other researchers (Flashoff & Snow, 1971; West
& Anderson, 1976). The belief that teacher expectations enhance or hinder a
student's achievement is widely held, and the role of testing in creating these
expectations is certainly worthy of further research. In summary, there is some merit in the
various criticisms concerning the possible undesirable effects of tests on
students; but more often than not, these criticisms should. be directed at the
users of the tests rather than the tests themselves. The same persons who
misuse test results are likely to misuse alternative types of information that
are even less accurate and objective. Thus, the solution is not to stop using
tests but to start using tests and other data sources of information more
effectively. When tests are used in a positive manner that is, to help students
improve their learning and development-the consequences are likely to be
desirable rather than undesirable,
Fairness of Tests to Minorities
The issue of test fairness to racial and ethnic minorities
is a critical issue for any assess ment program. Fairness has received increasing
attention over the years in the literature on testing and assessment. Concern
with the fairness of tests has paralleled the general public concern with
providing equal rights and opportunities to all U.S. citizens. An analy sis of
the discussions of fairness in testing and assessment makes it evident that the
con cept of fairness is used in many different ways. Testing and assessment
professionals have often equated fairness with an absence of bias-considering,
for example, a test use as fair if predictions of nontest performance are
comparable for different racial or ethnic groups. Fairness may also focus on
the equity in treatment of persons from different groups in the assessment
process. This second notion of fairness is sometimes referred to as procedural
fairness and involves questions such as the following: Do test takers have
an equal opportunity to show what they
know and can do on a test? Are responses to essay questions graded comparably
by raters without regard to the test taker's group membership? A third meaning
of fairness would require that students be provided with an adequate or equal
opportunity to learn the material that is assessed. A fourth meaning of
fairness that is common in some popular uses of the term is the equality of
results. From this perspective, a test would be considered fair only if the
average performance of different groups (e.g., African Americans, Latinos, and
whites) was the same. These four
conceptions of fairness-absence of bias, procedural fairness, opportunity to
learn, and equality of results are consistent with the discussion of the topic
in the latest revision of the Standards for Educational and Psychological
Testing (AERA, APA, & NCME, 1999). The different perspectives can lead to
quite different conclusions about the fairness of any test or assessment. The
fourth conception, equality of results, is incompatible with other tenets of
testing and assessment, such as the goal of getting a valid and reliable
measure of what students know and can do regardless of their background or
group membership. Inasmuch as different groups of students differ in the
instruction they have received, their experiences both in and out of school,
and their interest and effort, one cannot expect a valid measure to show no
difference between the groups. In other words, a test or assessment that shows
average differences in scores for minority and majority group students may
fairly reflect the consequences of unfair treatment of minorities by the
society. An absence of bias and procedural
fairness is essential for an assessment to have a high level of validity in
measuring the knowledge, skills, and understandings that it is intended to
measure. In other words, those characteristics are essential to avoid unfaimess
due to faulty measurement. Whether
professionals in testing and assessment would consider adequacy or equality of
opportunity to learn as essential for fairness generally depends not only on
the instrument but also on the uses and interpretations that are made of the
results. Thus, it would be con sidered fair to use results of an achievement
test to document inequalities in education where there were substantial
differences in opportunity to learn. But it would not be fair to use the test
results as the basis for rewards or sanctions for students under those
conditions. Nor would it be fair to infer that the group that was not provided
with an adequate opportunity to learn was intellectually inferior or incapable
of learning the material on the assessment. It is often useful to distinguish
between (a) the possible presence of bias in the test content and (b) the possible unfair use of
test results. These factors are undoubtedly related, but we discuss them here
separately.
In evaluating the possible presence of bias in test content,
it is important to distinguish between the performance the test is intended to
measure and factors that may distort the scores unfairly. In testing
mathematics skills with story problems, for example, it is impor tant to keep
the reading level low so that the test scores are not contaminated by reading
ability. If the reading level is too difficult, poor readers will obtain lower
scores than war ranted, and the test will be biased against them. Because a
particular minority group may have a disproportionately large number of poor
readers, the test may be biased against that minority group. But if the test of
mathematics skills is not contaminated by reading or other factors, low scores
will simply indicate lack of mathematics skills. Such a test is fair to
everyone even if the test scores indicate group differences in the mastery of
mathematics.
In the past, standardized tests typically emphasized content
and values that were more familiar to white middle-class students than to
racial or ethnic minorities and students of lower socioeconomic status. Thus,
the content of some scholastic aptitude tests contained vocabulary, pictures,
and objects that minorities had less opportunity to learn. Similarly, some
reading tests contained stories and situations that were unrelated to their
life expe riences. Racial and ethnic minorities were seldom represented in
pictures, stories, and other test content, and when they were, it was sometimes
in an offensive manner. How much these types of bias might have lowered the
scores of individual students is impos sible to say, but most persons familiar
with testing would acknowledge some adverse effect. Fortunately, test
publishers have taken steps to correct the situation. Test publish ers now
employ staff members representing various racial and cultural minorities, and
new tests are routinely reviewed for content that might be biased or offensive
to minority groups. Statistical analysis is also being used to detect and remove
test items that function
Differently for different groups of test takers (Cole &
Moss, 1989). The most controversial problems concerning the fair use of tests
with minority groups are encountered when aptitude tests are used as a basis
for educational and vocational selection. Much of the difficulty here is with
the definition of fair test use. One view is that a test is fair or unbiased if
it predicts as accurately for minority groups as it does for the majority
group. This traditional view, which favors a common cutoff score for selection,
has been challenged as being unfair to minority groups because they often earn
lower test scores, and thus a smaller proportion of qualified individuals tends
to be selected. Alter native definitions of test fairness favor some type of
adjustment, such as separate cutoff scores or bonus points, for some
minorities.
Ultimately, the decision as to whether minority group
membership is to be ignored or given special consideration in selection will
not be made by educators or psychologists. The fair use of tests in selection
is part of a larger issue that must be settled by society through court
rulings. Stated in simplified form, the issue is how equal educational and
occupational opportunities can best be provided for members of minority groups
without infringing on the rights of other.
Gender Fairness
The issues of fairness regarding testing and assessment have
also attracted considerable attention in the use and interpretation of tests
and assessments for males and females. For example, the use of scores on tests
such as the Preliminary SAT (PSAT) as the basis for iden tifying National Merit
Scholars has focused attention on the issue of gender bias in tests in recent
years. On the mathematics section of the PSAT, the average score for males has
been higher than the average score for females, and there have been more males
than females with very high scores for many years. In recent years, the average
score for males is also slightly higher than the average for females on the
verbal section of these tests. As a con sequence, a substantially larger
percentage of male National Merit Scholars have been iden tified than female
scholars. As a partial response to this difference, writing scores, on which
females tend to score higher than males, have been added to the mix for
identifying National Merit Scholars, thereby reducing the gender difference in
recognition of scholars. As in the case of differences in scores of racial or
ethnic groups, the existence of a dif ference in average scores for males and
females does not necessarily imply that the test is biased. There are, for
example, differences in the number of mathematics courses taken by females and
males in high school, and these differences may lead to differences in the
scores on the mathematics tests (Willingham & Cole, 1997). Whatever the
cause of the differences, however, it does not change the fact that use of the
tests alone results in a larger propor tion of scholarships being awarded to
males than to females despite the fact that females earn higher grades in
school than males on average. Judgments about the proper use of test scores
must rest on more than technical evaluations of the tests or the degree to
which the scores reflect real differences in knowledge, skills, and developed
abilities rather than un
Intentional biases. Such judgments also involve questions of
social values and social policy. Although not a minority group, women have had
some of the same problems as minorities have in attempting to obtain equal
educational and occupational opportunities. Thus, in the use of test results in
career planning, care needs to be taken so that test scores are not unfairly
used to direct females away from certain occupations. For example, females tend
to score lower than males on mechanical comprehension tests and to have lower
mechanical interest scores. Although these differences probably reflect
cultural influences rather than sex bias in the tests, it would be unfortunate
if such results were used to limit the occupations females might consider as
possible careers,
Willingham and Cole (1997) have reported a comprehensive set
of analyses examin ing differences in the performance of males and females on a
variety of types of tests and assessments. They also have investigated factors
related to the differences found and com parability in the validity of the
measures when they are used for various purposes, such as predicting subsequent
performance of males and females.
Formative Assessments
Several distinctions can be made between formative
assessments and mandated tests. First, classroom teachers control the timing
and use of formative assessments, whereas man dated tests are controlled from
afar and must be administered on a schedule dictated by the state or district.
Second, formative assessments are flexible and may be adapted to the
instructional needs of individual students, whereas external tests are
standardized and uniform for all students. Moreover, as noted, external tests
provide evaluations of student learning whereas formative assessments provide
measures for learning; that is, the main purpose of formative assessments is to
provide information that can be used to guide instruction and enhance student
learning.
To be effective tools of teaching and learning, formative
assessments must be consis tent with important student learning goals (see, for
example, Shepard, 2001). Teachers must be able to control the time that
formative assessments are administered and the choice of tasks that students
are asked to perform. The assessments must provide feed back to students and
teachers (Black & Wiliam, 1998). Rubrics that teachers use to score
constructed responses to assessment tasks must be transparent and made part of
the feed back to students. Clear instructional goals and specifications of
assessment tasks, together with transparent scoring rubrics, can contribute to
improved learning by helping students use self-assessments to monitor their own
progress.
SUMMARY
You are likely to encounter a number of externally mandated
testing and assessment
Programs in schools. As a teacher, you may be directly
involved in some of the programs.
Others you may simply need to know about so that you can
serve as an informed
Professional in dealing with students, parents, and the
public. These programs
Generally contributed to expanded testing and assessment in
the schools, which in turn
Have created new trends and raised concerns. You will also
have opportunities to construct
And use formative assessments. When properly used, the
latter assessments can be effec
Tive tools for enhancing instruction and student learning.
Subsequent chapters provide
Elaborations of these ideas.
Demands for accountability have led to substantial increases
in the amount of test
Ing and assessment in the schools and in the importance that
is attached to the scores.
Minimum-competency testing programs that required students
to pass tests to receive
High school diplomas or to be promoted to the next grade
grew rapidly throughout the
Country in the late 1970s. Demands for comparisons of
schools, school districts, and
States in terms of student achievement test scores led to
the introduction of still more
Testing and assessment requirements and increased the stakes
for teachers and school
Administrators.
In recent years, emphasis has been placed on establishing
demanding content and
Performance standards for all students. The demanding
performance standards have
Increased the stakes associated with assessment-based
accountability systems in a number
Of states. At the same time, the pressure to include all
students has presented new chal
Lenges in developing assessments and administration
procedures that provide adequate
Accommodations for students with special needs or students
who are English-language
Learners.
The NCLB Act of 2001 has requirements for increases in the
amount of testing that
States had mandated before the act became law. It reinforces
the role of content and
Student performance standards. Most important, the act
requires states to use student
Test results to hold schools and districts accountable for
student achievement in reading
Or language arts and mathematics and, beginning with the
2007-2008 school year, in
Science.
Several changes in educational measurement have been broad
enough and persistent
Enough to be identified as trends. In addition to the
expanded uses of tests that have just
Been summarized, developments are changing the nature of
testing and assessment. Most
Notable of these developments is the increased emphasis on
performance-based assess
Ment. Concerns that tests and assessments strongly influence
what gets taught have led to
Increased emphasis on making assessments that correspond as
closely as possible com
Plex learning objectives that are not readily measured by
conventional tests. Demands for
New forms of assessment have led to changes in approaches
being used to measure
Teacher performance as well as student performance.
Technological developments also promise to change the nature
of testing and assess
Ment. Computers provide a means of making tests adaptive to
the individual test taker and
Constructing realistic simulations to test problem-solving
skills. On-line, computer-based
Testing has certain potential advantages over traditional
paper-and-pencil tests. In addition
To making adaptive testing, on-line testing provides
immediate access to results and
May be especially useful as part of an early warning system
and in preparing for high
Stakes tests. On-line testing may also provide cost savings.
The biggest hurdle to wider
Implementation of computer-based, on-line testing is the
demand for access to computers
In the schools that exceeds current availability in many
places.
Critics of testing have raised issues concerning the
possible consequences of testing
In the schools. Most of the criticism has been directed
toward standardized tests, including
Such issues as the nature and quality of the tests, the
possible harmful effects of testing on
Students, and the fairness of tests to minorities and women.
It is important to recognize
The multiple perspectives on what constitutes faimess in
testing and assessment, including
The absence of bias, procedural fairness, adequacy of
opportunity to learn, and equality
Of outcomes. Although the nature of the test or assessment
itself requires close attention.
In considering issues of fairness, in many instances it is
the use or misuse of the results of
Tests and assessments that is most critical to achieving
fairness.
Formative assessments constructed by classroom teachers
differ from external tests in
A number of ways. They are less formal and may be
administered at different times
Throughout the school year as deemed appropriate by the
needs of individual students.
They can contribute to improvements in instruction and
student learning by providing
Timely feedback and by clarifying expectations in ways that
students can use for purposes.
Of ongoing self-assessment.