DeepSeek Ups the Open Weights Ante DeepSeek-V3 redefines LLM performance and cost efficiency

Published

Jan 15, 2025

Reading time

3 min read

A new model from Hangzhou upstart DeepSeek delivers outstanding performance and may change the equation for training costs.

What’s new: DeepSeek-V3 is an open large language model that outperforms Llama 3.1 405B and GPT-4o on key benchmarks and achieves exceptional scores in coding and math. The weights are open except for applications that involve military uses, harming minors, generating false information, and similar restrictions. You can download them here.

Mixture of experts (MoE) basics: The MoE architecture uses different subsets of its parameters to process different inputs. Each MoE layer contains a group of neural networks, or experts, preceded by a gating module that learns to choose which one(s) to use based on the input. In this way, different experts learn to specialize in different types of examples. Because not all parameters are used to produce any given output, the network uses less energy and runs faster than models of similar size that use all parameters to process every input.

How it works: DeepSeek-V3 is a mixture-of-experts (MoE) transformer that comprises 671 billion parameters, of which 37 billion are active at any moment. The team trained the model in 2.79 million GPU hours — less than 1/10 the time required to train Llama 3.1 405B, which DeepSeek-V3 substantially outperforms — at an extraordinarily low cost of $5.6 million.

The developers trained it on roughly 15 trillion tokens, including a larger percentage of coding and math data relative to DeepSeek-V2. They fine-tuned it on a wide variety of tasks using output generated by DeepSeek-R1 and DeepSeek-V2.5. They further sharpened its performance across diverse domains using the reinforcement learning algorithm known as group relative policy optimization.
Earlier work showed that training to predict the next two tokens would improve performance over learning to predict just one. The authors implemented this procedure. The model learned to predict the first token as usual and used an additional set of layers to learn to predict the second token. The additional layers aren’t used at inference.
Following DeepSeek-V2, DeepSeek-V3 uses multi-head latent attention, which saves memory during execution relative to other variants of attention.
Also like DeepSeek-V2, the new model combines dedicated (routed) and shared experts. The model chooses eight of 256 experts for a particular input, but it also uses a shared expert that processes all inputs.

Results: In DeepSeek’s tests, DeepSeek-V3 outperformed Llama 3.1 405B and Qwen 2.5 72B across the board, and its performance compared favorably with that of GPT-4o.

DeepSeek-V3 showed exceptional performance in coding and math tasks. In coding, DeepSeek-V3 dominated in five of the seven benchmarks tested. However, DeepSeek-V3 lost to o1 on one of the five, according to a public leaderboard. Specifically, on Polyglot, which tests a model’s ability to generate code in response to difficult requests in multiple programming languages, DeepSeek-V3 (48.5 percent accuracy) beat Claude Sonnet 3.5 (45.3 percent accuracy), though it lost to o1 (61.7 percent accuracy).
In language tasks, it performed neck-and-neck with Claude 3.5 Sonnet, achieving higher scores in some tasks and lower in others.

Behind the news: OpenAI’s o1 models excel thanks to agentic workflows in which they reflect on their own outputs, use tools, and so on. DeepSeek swims against the tide and achieves superior results without relying on agentic workflows.

Why it matters: Open models continue to challenge closed models, giving developers high-quality options that they can modify and deploy at will. But the larger story is DeepSeek-V3’s shockingly low training cost. The team doesn’t explain precisely how the model achieves outstanding performance with such a low processing budget. (The paper credits “meticulous engineering optimizations.”) But it’s likely that DeepSeek’s steady refinement of MoE is a key factor. DeepSeek-V2, also an MoE model, saved more than 40 percent in training versus the earlier DeepSeek 67B, which didn’t employ MoE. In 2022, Microsoft found that MoE cost five times less in training for equal performance compared to a dense model, and Google and Meta reported that MoE achieved better performance than dense models trained on the same numbers of tokens.

We’re thinking: If they can be replicated, DeepSeek’s results have significant implications for the economics of training foundation models. If indeed it now costs around $5 million to build a GPT-4o-level model, more teams will be able to train such models, and the cost of competing with the AI giants could fall dramatically.

Subscribe to The Batch