Think D̶i̶f̶f̶e̶r̶e̶n̶t̶ Small Apple releases OpenELM, a family of smaller large language models.

Published

May 1, 2024

Reading time

2 min read

Apple is thinking small — very small — with a new family of open large language models.

What's new: Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, and colleagues at Apple released Open Source Efficient LLM (OpenELM), a family of smaller large language models. OpenELM ranges from 270 million parameters — plenty small enough to fit on a phone — to 3 billion parameters.

How it works: OpenELM comes in pretrained and instruction-tuned versions with parameter counts of 270 million, 450 million, 1.1 billion, and 3 billion. They can process 2,048 tokens of context. The release includes weights, code for training and inference, and code for running the models on Apple chips.

The authors pretrained OpenELM on 1.8 trillion tokens drawn from subsets of publicly available text datasets.
They fine-tuned the instruction-tuned models on the UltraFeedback dataset of 60 thousand prompts.
OpenELM follows most of the architecture choices of current state-of-the-art transformer models with a major exception: The number of attention heads and size of fully connected layers increase the deeper in the network they are, following the idea that layers later in the network learn more complex representations of the input than early ones. This architecture contrasts to the current common practice, in which a transformer’s number of attention heads and size of fully connected layers remains consistent throughout the network.

Results: OpenELM beat a number of other open-source models trained solely on publicly available data.

For example, on average across five tasks on the Open LLM Leaderboard, a 1.08 billion parameter OpenELM beat a 1.18 billion parameter OLMo 45.93 percent to 43.57 percent, although OLMo trained on twice as much data. The 270 million-parameter OpenELM achieved 38.72 percent.
Comparing speed between OpenELM models that ran on consumer-grade computers, the 270 million-parameter model was over twice as fast as the 3 billion-parameter version. Apple did not present results obtained on phones.
OpenELM fell short on MMLU (multiple choice questions from mathematics to microeconomics), achieving within 2.05 percent of random chance (25 percent) for all model sizes. To be fair, the other models chosen for comparison didn’t do much better. It’s possible that publicly available data isn’t sufficient for learning to solve MMLU. By comparison, Microsoft’s Phi-3-mini (3.8 billion parameters trained on web data filtered according to “educational level” plus generated data) achieved 68.8 percent accuracy.

Why it matters: After years of becoming only larger, neural networks lately have also been getting smaller. The smallest OpenELMs are tiny compared to, say, Microsoft’s Phi-3-mini. Apple has an extra incentive to make models capable of running on edge devices like phones. The company makes a major selling point of user privacy, and models run entirely on a smartphone (as opposed to in the cloud) keep the user’s activity under wraps.

We're thinking: DeLighT introduced this layer-scaling approach in 2020. Sometimes it takes a while for good ideas to catch on!

Subscribe to The Batch