The parade of ever more capable LLMs continues with Qwen 2.5.
What's new: Alibaba released Qwen 2.5 in several sizes, the API variants Qwen Plus and Qwen Turbo, and the specialized models Qwen 2.5-Coder and Qwen 2.5-Coder-Instruct and Qwen 2.5-Math and Qwen 2.5-Math-Instruct. Many are freely available for commercial use under the Apache 2.0 license here. The 3B and 72B models are also free, but their license requires special arrangements for commercial use.
How it works: The Qwen 2.5 family ranges from 500 million parameters to 72 billion parameters.
- Qwen 2.5 models were pretrained on 18 trillion tokens. Sizes up to 3 billion parameters can process up to 32,000 input tokens; the larger models can process up to 128,000 input tokens. All versions can have an output length of 8,000 tokens.
- Qwen 2.5-Coder was further pretrained on 5.5 trillion tokens of code. It can process up to 128,000 input tokens and generate up to 2,000 output tokens. It comes in 1.5B and 7B versions.
- Qwen 2.5-Math further pretrained on 1 trillion tokens of math problems, including Chinese math problems scraped from the web and generated by the earlier Qwen 2-Math-72B-Instruct. Qwen 2.5-Math can process 4,000 input tokens and generate up to 2,000 output tokens. It comes in 1.5B, 7B, and 72B versions. In addition to solving math problems, Qwen 2.5-Math can generate code to help solve a given math problem.
Results: Compared to other models with open weights, Qwen 2.5-72B-Instruct beats LLama 3.1 405B Instruct and Mistral Large 2 Instruct (123 billion parameters) on seven of 14 benchmarks including LiveCodeBench, MATH (solving math word problems), and MMLU (answering questions on a variety of topics). Compared to other models that respond to API calls, Qwen-Plus beats LLama 3.1 405B, Claude 3.5 Sonnet, and GPT-4o on MATH, LiveCodeBench, and ArenaHard. Smaller versions also deliver outstanding performance. For instance, Qwen 2.5-14B-Instruct outperforms Gemma 2 27B Instruct and GPT-4o mini on seven benchmarks.
Behind the news: Qwen 2.5 extends a parade of ever more capable LLMs that include Claude 3.5 Sonnet, GPT-4o, and LLama 3.1 as well as the earlier Qwen 2 family.
Why it matters: The new models raise the bar for open weights models of similar sizes. They also rival some proprietary models, offering options to users who seek to balance performance and cost.
We’re thinking: Some companies encourage developers to use their paid APIs by locking their LLMs behind non-commercial licenses or blocking commercial applications beyond a certain threshold of revenue. We applaud Qwen’s approach, which keeps most models in the family open.