Smaller Is Beautiful Compact AI models redefine efficiency, bringing advanced capabilities to everyday devices

Published

Dec 25, 2024

Reading time

3 min read

For years, the best AI models got bigger and bigger. But in 2024, some popular large language models were small enough to run on a smartphone.

What happened: Instead of putting all their resources into building big models, top AI companies promoted families of large language models that offer a choice of small, medium, and large. Model families such as Microsoft Phi-3 (in versions of roughly 3.8 billion, 7 billion, and 14 billion parameters), Google Gemma 2 (2 billion, 9 billion, and 27 billion), and Hugging Face SmolLM (135 million, 360 million, and 1.7 billion) specialize in small.

Driving the story: Smaller models have become more capable thanks to techniques like knowledge distillation (in which a larger teacher model is used to train a smaller student model to match its output), parameter pruning (which removes less-influential parameters), quantization (which reduces neural network sizes by representing each parameter with fewer bits), and greater attention to curating training sets for data quality. Beyond performance, speed, and price, the ability to run on relatively low-powered hardware is a competitive advantage for a variety of uses.

Model builders have offered model families that include members of various sizes since at least 2019, when Google introduced the T5 family (five models between roughly 77 million parameters and 11 billion parameters). The success of OpenAI’s GPT series, which over time grew from 117 million parameters to a hypothesized 1.76 trillion parameters, demonstrated the power of bigger models. OpenAI researchers formulated scaling laws that appeared to guarantee that bigger models, training sets, and compute budgets would lead to predictable improvements in performance. This finding spurred rivals to build larger and larger models.
The tide started to turn in early 2023. Meta’s Llama 2 came in parameter counts of roughly 7 billion, 13 billion, and 70 billion with open weights.
In December 2023, Google launched the Gemini family, including Gemini Nano (1.8 billion parameters). In February, it released the small, open weights family Gemma 1 (2 billion and 7 billion parameters), followed by Gemma 2 (9 billion and 27 billion).
Microsoft introduced Phi-2 (2.7 billion parameters) in December 2023 and Phi-3 (3.8 billion, 7 billion, and 14 billion) in April.
In August, Nvidia released its Minitron models. It used a combination of distillation and pruning to shrink Llama 3.1 from 8 billion to 4 billion parameters and Mistral NeMo from 12 billion to 8 billion parameters, boosting speed and lowering computing costs while maintaining nearly the same level of accuracy.

Behind the news: Distillation, pruning, quantization, and data curation are longstanding practices. But these techniques have not resulted in models quite this ratio of size and capability before, arguably because the larger models that are distilled, pruned, or quantized have never been so capable.

In 1989, Yann LeCun and colleagues at Bell Labs published “Optimal Brain Damage,” which showed that deleting weights selectively could reduce a model’s size and, in some cases, improve its ability to generalize.
Quantization dates to 1990, when E. Fiesler and colleagues at the University of Alabama demonstrated various ways to represent the parameters of a neural network in “A Weight Discretization Paradigm for Optical Neural Networks.” It made a resurgence 2010’s with the growth in popularity and sizes of neural networks, which spurred the refinements quantization-aware training and post-training quantization.
In 2006, Rich Caruana and colleagues at Cornell published “Model Compression,” showing how to train a single model to mimic the performance of multiple models. Geoffrey Hinton and colleagues at Google Brain followed in 2015 with “Distilling the Knowledge in a Neural Network,” which improved the work of Caruana et al. and introduced the term distillation to describe a more general way to compress models.
Most of the current crop of smaller models were trained on datasets that were carefully curated and cleaned. Higher-quality data makes it possible to get more performance out of fewer parameters. This is an example of data-centric AI, the practice of improving model performance by improving the quality of their training data.

Where things stand: Smaller models dramatically widen the options for cost, speed, and deployment. As researchers find ways to shrink models without sacrificing performance, developers are gaining new ways to build profitable applications, deliver timely services, and distribute processing to the edges of the internet.

Subscribe to The Batch