Why Teacher Evaluation Reforms Haven't Worked--And How They Can

Tying evaluations to growth in reading test scores can be counter-productive and unfair.

Dec 21, 2021

We’ve spent billions trying to improve teacher quality, without much to show for it. Instead of tying teachers’ evaluations to test scores, let’s help them switch to methods grounded in the science of learning.

Back in the 1890s, 99.5% of teachers in New York City were rated “good,” based on observations by their principals. A hundred years later, things hadn’t changed much: virtually all teachers in the U.S. were rated “satisfactory” or better. And yet, average student performance was anything but satisfactory.

In the early 2000s, education reformers seized on the idea that the way to improve education was to boost teacher quality through more rigorous evaluation. They argued that teachers should be rated for, among other things, their ability to boost student test scores in reading and math. And they urged school districts to use those evaluations to reward stronger teachers and provide weaker ones with the help they need—or, if that didn’t work, get rid of them.

It sounded plausible—so plausible that in 2009 the Obama administration used $4.35 billion in federal funds to incentivize states to adopt that approach, among other education reforms. A few years later, the administration offered waivers from the virtually impossible requirements of No Child Left Behind if states overhauled their evaluation systems. These inducements had their effect: the number of states requiring test-score data to be factored into teacher evaluations rose from 15 in 2009 to 43 in 2015. Between 2012 and 2018, by one “conservative estimate,” government expenditures on teacher evaluation reform amounted to $15 to 20 billion.

But a recent comprehensive study finds that on balance, none of this worked. Looking at 44 states and Washington, D.C., the researchers analyzed whether “high-stakes” teacher evaluation reforms boosted student test scores, high school graduation rates, or college enrollment. Six districts saw gains, but many others didn’t, and six saw losses. Overall, the study concluded, the reforms had no discernible effect. And over 99% of teachers were still rated satisfactory or better, with few being dismissed for low performance.

That’s pretty much what happened back in the 1890s when a new superintendent tried to reform New York City’s evaluation system, as recounted in Dana Goldstein’s The Teacher Wars. He instituted an A to D scale, but the vast majority of teachers still routinely got a B+. The New York Times declared the reform “a joke.”

The problem with the most recent round of reforms, some experts say, is that they simply weren’t rigorous enough. There was too much pressure from unions not to fire teachers, they argue, and districts were unwilling to spend the money that would have made evaluation effective. Or perhaps, as one of the researchers told an interviewer, most districts don’t have access to “a ready supply” of more effective teachers to replace low-performers.

The district the researchers point to as a genuine “bright spot” is Washington, D.C., relying largely on a previous study of D.C.’s unusually rigorous IMPACT evaluation system. But their conclusion that IMPACT has boosted student achievement rests, ultimately, on a dissertation finding that scores increased in math but not reading.

That result echoes another study, which—though hailed as a vindication of D.C.’s general approach to education reform—found no improvement in eighth-grade reading, after controlling for changes in student demographics. On national tests, only 26% of eighth-graders in the D.C. public school system score proficient or above in reading; when it comes to Black students, the figure is 14%. So even the “bright spot” in teacher evaluation reform doesn’t look all that bright.

Why hasn’t what seemed like a plausible theory about teacher evaluation worked in practice? One part of the problem is that linking evaluations to test scores ends up being counter-productive, at least in reading. Schools and teachers increase the time kids spend practicing comprehension “skills,” like “making inferences”—the kind of skills that standardized reading tests purport to measure—using easy texts on random topics. As a result, they spend less time on social studies and science. But those subjects are the most likely to build the academic knowledge and vocabulary that kids actually need to understand passages on reading tests.

Standardized tests have their value: they can reveal inequities that might otherwise remain hidden. But when it comes to evaluating teachers and schools, we should put less emphasis on reading scores—not more—because they can mislead educators about what kids need. Passages on reading tests aren’t connected to any particular content, so schools prioritize comprehension skills, especially in the elementary grades.

That can produce short-term gains, but it’s likely to lead to long-term failure. As grade levels go up, tests assume increasing amounts of academic vocabulary. If students haven’t been able to acquire that vocabulary in prior years because of the focus on comprehension “skills,” they’ll be at a disadvantage that no individual teacher can do much to compensate for in the course of a school year. Building knowledge and vocabulary is a gradual, cumulative process that should start as early as possible.

More generally, education orthodoxy—as enshrined in teacher training, textbooks, and curricula—often conflicts with what scientists have discovered about how learning works. For example, teachers are advised not to waste time having kids memorize names and dates but focus instead on “higher-order” skills like critical thinking. Scientists who study the learning process, however, have found that having factual information in long-term memory is what enables people to think critically. If teachers are using methods that science has found are unlikely to work, as many are—and if they’re also being evaluated by people trained to believe in those methods—an evaluation system won’t be able to boost student achievement, no matter how rigorous it is.

Teacher training programs are unlikely to change anytime soon, at least on a large scale. But states and school districts can switch to curricula grounded in approaches that are supported by science—and, crucially, help teachers learn to implement those curricula well. Evidence indicates that approach can have about the same positive effect on student test scores as replacing an average teacher with one who is highly effective.

That suggests that the best way to improve teacher quality is to give teachers instructional materials that can work and ongoing coaching in how to use them—delivered by someone intimately familiar with those materials, whose job is to support teachers rather than evaluate them. Some of this is beginning to happen in school districts across the country, with promising results.

We do need to evaluate teachers’ performance in the classroom, but it’s unfair—and ineffective—to do that when we’re also asking them to use methods that don’t work. And tying evaluations to tests that have nothing to do with the substance of what’s been taught is both counter-productive and unfair to teachers. If we were to spend even a fraction of the billions poured into test-based teacher evaluation on curriculum and professional support aligned with science, we might at last begin to turn the millions of “satisfactory” teachers into highly effective ones.

This post originally appeared on Forbes.com.

Lois Letchford

Dec 22, 2021

Agree with all you have written! I’ve specialized in teaching the most vulnerable kids-& in order to teach them, we as teachers, have to get everything right-doing exactly as you suggest here-build background knowledge, provide geographical info, dates & time lines so the students can make connections to think critically.

Expand full comment

1 reply by Natalie Wexler

1 more comment...

Minding the Gap

Discussion about this post