Researchers found serious flaws in an influential language dataset, highlighting the need for better documentation of data used in machine learning.
What’s new: Northwestern University researchers Jack Bandy and Nicholas Vincent investigated BookCorpus, which has been used to train at least 30 large language models. They found several ways it could impart social biases.
What they found: The researchers highlighted shortcomings that undermine the dataset’s usefulness.
- BookCorpus purportedly contains the text of 11,038 ebooks made available for free by online publisher Smashwords. But the study found that only 7,185 of the files were unique. Some were duplicated up to five times. Nearly 100 contained no text at all.
- By analyzing words related to various religions, the researchers found that the corpus focuses on Islam and Christianity and largely ignores Judaism, Hinduism, Buddhism, Sikhism, and Atheism. This could bias trained models with respect to religious topics.
- The collection is almost entirely fiction and skews heavily toward certain genres. Romance novels, the biggest genre, comprise 26.1 percent of the dataset. Some of the text in those books, the authors suggest, could contain gender-related biases.
- The dataset’s compilers did not obtain consent from the people who wrote the books, several hundred of which include statements that forbid making copies.
Behind the news: The study’s authors were inspired by previous work by researchers Emily Bender and Timnit Gebru, who proposed a standardized method for reporting how and why datasets are designed. The pair outlined in a later paper how lack of information about what goes into datasets can lead to “documentation debt,” costs incurred when data issues lead to problems in a model’s output.
Why it matters: Skewed training data can have substantial effects on a model’s output. Thorough documentation can warn engineers of limitations and nudge researchers to build better datasets — and maybe even prevent unforeseen copyright violations.
We’re thinking: If you train an AI model on a library full of books and find it biased, you have only your shelf to blame.