Datasets

53 Posts

Cartoon of a ghost helping a professor answer Halloween trivia questions on a chalkboard, with students watching.
Datasets

Benchmark Tests Are Meaningless: The problem with training data contamination in machine learning

The universe of web pages includes correct answers to common questions that are used to test large language models. How can we evaluate new models if they’ve studied the answers before we give them the test?
Hierarchical K-means diagram with data clustering across multiple layers.
Datasets

Balancing Web Data Distributions: Automated method organizes large datasets for better model performance

Datasets that were scraped from the web tend to be unbalanced, meaning examples of some classes (say, cats) are plentiful while examples of others (say, caterpillars) are scarce.
Illustration of a wizard with a facial expression evoking 'The Scream' painting in front of a half-empty bookshelf
Datasets

Data Disappears: Creative workers don't want AI developers to train models on their work

The latest advances in AI are built on freely available training data. What will happen if it becomes off-limits? Creative workers don’t want AI developers to train models on their works without permission or compensation, or at all. Data is vanishing as they scramble to lock it down. 
Fused swarm-box-violinplot that captures HCR metrics
Datasets

More Scraped Data, Greater Bias: Research shows that training on larger datasets can increase social bias.

How can we build large-scale language and vision models that don’t inherit social biases? Conventional wisdom suggests training on larger datasets, but research challenges this assumption.
News Outlet Challenges AI Developers: The New York Times forbids the use of its work in training datasets.
Datasets

News Outlet Challenges AI Developers: The New York Times forbids the use of its work in training datasets.

The New York Times launched a multi-pronged attack on the use of its work in training datasets. The company updated its terms of service to forbid use of its web content and other data for training AI systems.
Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots
Datasets

Sample-Efficient Training for Robots: Reinforcement learning from human feedback to train robots

Training an agent that controls a robot arm to perform a task — say, opening a door — that involves a sequence of motions (reach, grasp, turn, pull, release) can take from tens of thousands to millions of examples...
Stable Biases: Stable Diffusion may amplify biases in its training data.
Datasets

Stable Biases: Stable Diffusion may amplify biases in its training data.

Stable Diffusion may amplify biases in its training data in ways that promote deeply ingrained social stereotypes.
Finer Tuning: Surgical fine-tuning modifies layers based on data differences.
Datasets

Finer Tuning: Surgical fine-tuning modifies layers based on data differences.

Fine-tuning a neural network typically involves retraining every layer on new data. But research shows that networks may perform better when fine-tuning modifies only a subset of layers.
Training Data Free-For-All: Japan's AI data laws, explained
Datasets

Training Data Free-For-All: Japan's AI data laws, explained

Amid rising questions about the fairness and legality of using publicly available information to train AI models, Japan affirmed that machine learning engineers can use any data they find.
LAION Roars: The story of LAION, the dataset behind Stable Diffusion
Datasets

LAION Roars: The story of LAION, the dataset behind Stable Diffusion

The largest dataset for training text-to-image generators was assembled by volunteers for roughly $10,000. Now it’s implicated in fights over whether copyrighted works can be used for training.
Data Does Not Want to Be Free: Reddit and Stack Overflow ask AI devs to pay for data.
Datasets

Data Does Not Want to Be Free: Reddit and Stack Overflow ask AI devs to pay for data.

Developers of language models will have to pay for access to troves of text data that they previously got for free. The discussion platform Reddit and question-and-answer site Stack Overflow announced plans to protect their data from being used to train large language models.
Different illustration showing the application of PCA to color populations
Datasets

PCA Raises Red Flags: Principal component analysis can negatively impact science.

Principal component analysis is a key machine learning technique for reducing the number of dimensions in a dataset, but new research shows that its output can be inconsistent and unreliable.
Graph with difference in test error in keeping hard versus easy examples
Datasets

Unsupervised Data Pruning: New method removes useless machine learning data.

Large datasets often contain overly similar examples that consume training cycles without contributing to learning. A new paper identifies similar training examples, even if they’re not labeled.
Dataset FOLIO example based on the Wild Turkey Wikipedia page
Datasets

Language Models Defy Logic: Large NLP models struggle with logical reasoning.

Who would disagree that, if all people are mortal and Socrates is a person, Socrates must be mortal? GPT-3, for one. Recent work shows that bigger language models are not necessarily better when it comes to logical reasoning.
3 graphs showing projections of data usage. Each one shows two extrapolations of data usage.
Datasets

Will We Have Enough Data?

The world’s supply of data soon may fail to meet the demands of increasingly hungry machine learning models. Researchers at Epoch AI found that a shortage of text data could cause trouble as early as this year. Vision data may fall short within a decade.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox