Building a foundation model takes tremendous amounts of data. In the coming year, I hope we’ll enable models to learn more from less data.
The AI community has achieved remarkable success by scaling up transformers and datasets. But this approach may be reaching a point of diminishing returns — an increasingly widespread belief among the pretraining community as they try to train next-generation models. In any case, the current approach poses practical problems. Training huge models on huge datasets consumes huge amounts of time and energy, and we’re running out of new sources of data for training large models.
The fact is, current models consume much more data than humans require for learning. We’ve known this for a while, but we’ve ignored it due to the amazing effectiveness of scaling. It takes trillions of tokens to train a model but orders of magnitude less for a human to become a reasonably intelligent being. So there’s a difference in sample efficiency between our best models and humans. Human learning shows that there’s a learning algorithm, objective function, architecture, or a combination thereof that can learn more sample-efficiently than current models.
One of the keys to solving this problem is enabling models to produce higher-level abstractions and filter out noise. I believe this concept, and thus the general problem of data efficiency, is related to several other current problems in AI:
- Data curation: We know that the specific data we use to train our models is extremely important. It’s an open secret that most of the work that goes into training foundation models these days is about the data, not the architecture. Why is this? I think it’s related to the fact that our models don’t learn efficiently. We have to do the work ahead of time to prepare the data for a model, which may hinder the core potential of AI as an automatic process for learning from data.
- Feature engineering: In deep learning, we always move toward more generalized approaches. From the beginning of the deep learning revolution, we’ve progressively removed handcrafted features such as edge detectors in computer vision and n-grams in natural language processing. But that engineering has simply moved to other parts of the pipeline. Tokenization, for instance, involves engineering implicit features. This suggests that there’s still a lot of room to make model architectures that are more data-efficient and more generally able to handle more raw modalities and data streams.
- Multimodality: The key to training a model to understand a variety of data types together is figuring out the core abstractions in common and relating them to each other. This should enable models to learn from less data by leveraging all the modalities jointly, which is a core goal of multimodal learning.
- Interpretability and robustness: To determine why a model produced the output it did, it needs to be able to produce higher-level abstractions, and we need to track the way it captures those abstractions. The better a model is at doing this, the more interpretable it should be, the more robust it should be to noise, and likely the less data it should need for learning.
- Reasoning: Extracting higher-level patterns and abstractions should allow models to reason better over them. Similarly, better reasoning should mean less training data.
- Democratization: State-of-the-art models are expensive to build, and that includes the cost of collecting and preparing enormous amounts of data. Few players can afford to do it. This makes developments in the field less applicable to domains that lack sufficient data or wealth. Thus more data-efficient models would be more accessible and useful.
Considering data efficiency in light of these other problems, I believe they’re all related. It’s not clear which is the cause and which are the effects. If we solve interpretability, the mechanisms we engineer may lead to models that can extract better features and lead to more data-efficient models. Or we may find that greater data efficiency leads to more interpretable models.
Either way, data efficiency is fundamentally important, and progress in that area will be an indicator of broader progress in AI. I hope to see major strides in the coming year.
Albert Gu is an Assistant Professor of Machine Learning at Carnegie Mellon University and Chief Scientist of Cartesia AI. He appears on Time’s list of the most influential people in AI in 2024.