To Evaluate Writing, Go With Your Gut--And Those of Many Others

Writing quality is hard to assess reliably, but a new approach that aggregates many individual comparative judgments can make it easier.

Apr 10, 2019

Writing skill is hard to measure on standardized tests, and as a result many teachers give writing short shrift. But a system of gut-level judgments, done on a mass scale, could change that.

With the advent of state-mandated reading and math tests, the curriculum in many schools has narrowed to those two subjects. Lip service is paid to the importance of teaching writing, but it’s been difficult to test it reliably. While the technology is improving, computers may give high marks for long words and sentences even if the writing makes no sense. And human judgments about writing are highly subjective. Attempts to counteract that subjectivity with checklists or “rubrics” lead to the same problems that occur with computers. If an evaluator is told to give points for the use of adverbs, she might give a high mark to a sentence like “Colorless green ideas sleep furiously.”

Without high-stakes writing tests, many teachers haven’t placed much weight on teaching it—and, to make matters worse, most have never been trained to teach it. The results are evident: only about a quarter of students score at or above the proficient level on national assessments, and surveys have found that employers feel writing ability is one of the biggest gaps in workplace readiness. “My students can’t write a clear sentence to save their lives,” one college English professor has complained.

But now there’s evidence that a surprising new method of evaluating writing is more efficient and more reliable than anything tried before. Instead of a computer or a rubric, it uses gut-level judgments by multiple readers, who are shown pairs of samples written in response to the same prompt. They click on the one they think is better, spending as little as twenty seconds on the decision. Then they’re shown another pair—and so on.

This approach, dubbed “comparative judgment,” is being pioneered by a London-based organization called No More Marking (“marking” being the British term for grading), founded by testing expert Dr. Chris Wheadon. The underlying principle is that when people compare two things, their judgments are more accurate than when they’re trying to evaluate a single item. (To test that out, try playing a color-judging game on the No More Marking website.)

The phrase “no more marking” is music to the ears of many teachers in England, who are required to evaluate enormous amounts of student writing. Government-mandated exams avoid the American multiple-choice format in favor of essay questions, and teachers of seven- and eleven-year-olds must submit several writing samples for each child across the school year. In addition, government inspectors regularly examine students’ composition books and “have traditionally expected to see lots of marking,” on the basis of a rubric, according to Daisy Christodoulou, No More Marking’s director of education. The crushing workload has led to high levels of teacher turnover, and there’s little evidence the system has had a positive impact on student achievement.

I tested out comparative judgment through one of the webinars No More Marking periodically offers. There were 45 participants judging a total of 15 writing samples by students in “Year 3,” the equivalent of second grade. Each student had been shown a picture of a perilous-looking bridge leading to a castle and asked to imagine themselves in the picture, needing to get to the castle. I found it sometimes easy to decide which of two samples was better and sometimes a bit tricky. But, as Christodoulou urged, I went with my gut. The result? There was almost total agreement among the participants on the relative quality of the samples. (The judgments are combined using an algorithm that creates a consistent scaled score for each piece of writing.)

Teachers and schools can sign up with No More Marking at no cost to judge how well their own students are doing in relation to each other; some say it’s helped them recognize aspects of good writing that aren’t captured by a rubric and provide better feedback to students. For a fee, No More Marking also allows schools to participate in mass judging events. In November 2018, for example, 421 schools in England and abroad participated in a writing “moderation” involving over 5,000 teachers. They evaluated writing samples from over 20,000 Year 3 students, based on the same prompt used for samples judged in the webinar—with very similar results.

A recent government report in England found that comparative judgment is as reliable as having two teachers grade the same writing sample, and “significantly quicker.” But Wheadon has pointed out that the report understated the power of the approach. It was based on only five judgments per sample, whereas No More Marking recommends a minimum of ten—and the report found that reliability continues to increase exponentially with the number of judgments. While it’s still not clear that comparative judgment could work on a national scale, the report recommended further research.

Still, there are caveats. One is that students may find it easier to compose fictional narratives than expository or persuasive essays that require knowledge of a specific topic—especially if they lack that knowledge. No More Marking recently held a “judging window” that asked over 20,000 Year 4 students to write a magazine article persuading other children to take up a hobby. As had happened with another nonfiction prompt, Christodoulou observed that some students didn’t have much to say “and ended up repeating themselves.” One alternative would be to have schools use their own prompts, based on the content they’re teaching, but it’s hard to compare samples on different topics. And although England, unlike the United States, has a national curriculum, it’s not detailed enough to provide a basis for prompts that would be fair to students at different schools.

Comparative judgment may also become less reliable as the quality of writing improves. Two beautifully written essays from students at upper grade levels may be hard to rank. But given the challenges so many students currently face in writing, that’s a problem we can only hope to confront at some point in the future.

This year, a New York City public school is participating in No More Marking’s judging sessions—the first American public school to do so. Still, it’s not clear when, if ever, education authorities in the United States will begin testing writing in a meaningful way, using comparative judgment or something else. In any event, comparative judgment may be able to determine whether students are writing well, but it can’t tell teachers how to turn struggling writers into competent ones.

What it could do, however, is enable comparisons between schools using different approaches and see which ones work better. That kind of evidence is sadly lacking, leaving teachers who want to focus on writing desperate for guidance. They may be susceptible to the claims of writing gurus who, relying on research they’ve conducted themselves, assert that their approaches get impressive results. Some teachers come up with their own methods, asserting that the key to success is to have kids edit each other’s work, or urging that students should just write voluminously with little feedback. But the fact is, those teachers are simply going with their gut. And apparently, that only works when numerous guts are involved.

This post originally appeared on Forbes.com.

Minding the Gap

Discussion about this post