A growing number of companies that sell standardized tests are using natural language processing to assess writing skills. Critics contend that these language models don’t make the grade.
What happened: An investigation by Motherboard found that several programs designed to grade English-language essays show bias against minorities and some students who speak English as a second language. Some models gave high scores to computer-generated essays that contained big words but little meaning.
What they found: Models trained on human-graded papers learn to correlate patterns such as vocabulary, spelling, sentence length, and subject-verb agreement with higher or lower scores. Some experts say the models amplify the biases of human graders.
- In 2018, the publishers of the E-Rater — software used by the GRE, TOEFL, and many states — found that their model gave students from mainland China a 1.3 point lift (on a scale of 0 to 6). It seems that Chinese students, while scoring low on grammar and mechanics, tend to write long sentences and use sophisticated vocabulary.
- The same study found that E-Rater docked African-American students by .81 points, on average, due to biases in grammar, writing style, and organization.
- Motherboard used BABEL to generate two essays of magniloquent gibberish. Both received two 4-out-of-6 scores from the GRE’s online ScoreItNow! practice tool.
Behind the news: At least 21 U.S. states use NLP to grade essays on standardized tests for public schools. Of those, 18 also employ human graders to check a small percentage of papers randomly.
Why it matters: Standardized tests help determine access to education and jobs for millions of Americans every year. Inappropriate use of NLP could be robbing them of life-changing opportunities.
We’re thinking: The company behind E-Rater is the only one that publishes studies on its grading model’s shortcomings and what it’s doing to fix them. Colleges and school boards should lead the charge in demanding that other test providers do the same.