o3-mini tops the AIME 2025 math leaderboard AlphaGeometry2 solves even more Olympiad-level problems

Published
Feb 10, 2025
Reading time
3 min read
Robots fencing in a high-tech lab while a human operator monitors data on multiple screens.

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

  • Replit’s agentic app for building more apps
  • ASAP experiments with training techniques for agile robots
  • Hugging Face’s smaller, open language model beats competitors
  • IBM uses RL to add reasoning to its open Granite models

But first:

MathArena tests AI models’ math skills with recent benchmarks

A new website from researchers at SRILab, ETHZurich, and INSAIT, tests large language models on recent math competitions to assess their reasoning and generalization capabilities. The site exclusively uses competitions that occurred after a model’s release (including the new AIME 2025) to ensure uncontaminated evaluation, and publishes leaderboards showing model performance on individual problems and across all competitions. This rigorous approach aims to provide standardized, comparable assessments of AI models’ mathematical problem-solving abilities, including the cost for each model to solve the test. Currently o3-mini-high leads the pack, solving 80 percent of the AIME 2025 problems at a cost of $3.19, followed by o1 and DeepSeek-R1, which both achieved lower accuracy at higher costs. (MathArena)

Updated AI system matches top geometry competitors

Google DeepMind’s AlphaGeometry2 made significant progress in solving International Mathematical Olympiad geometry problems, solving 84% of geometry problems from IMO competitions between 2000 and 2024, a level comparable to top human contestants. Key improvements to the system include an expanded domain language covering locus theorems and linear equations, a faster symbolic engine, and a novel algorithm combining multiple search trees. While AlphaGeometry2 excels at many problems, some of the most challenging IMO questions remain unsolved, indicating areas for future development. (arXiv and TechCrunch)

Replit launches agent-powered app creation tool for mobile devices

Replit updated its iOS and Android apps to include Agent, an AI-powered software creation tool. The company also expanded access to its existing Agent desktop tool and added a free tier for all users. Agent allows users to build and deploy apps through natural language conversations, handling coding, databases, integrations, and hosting without requiring a laptop. A new platform allows users to share their apps with others. This development could introduce software-development tools to a less technical audience, lowering the barriers to entry for app creation and sharing across devices. (Replit)

Two-stage framework boosts humanoid robot agility

Carnegie Mellon and Nvidia researchers developed ASAP, a two-stage framework that addresses the mismatch between simulated and real-world robot dynamics. The method pretrains motion tracking policies using human motion data, then collects real-world data to train a model that compensates for dynamics differences, significantly improving agility and coordination across various motions. This breakthrough could accelerate the development of robots capable of performing complex, expressive, human-like tasks in multiple environments. (Human2Humanoid and arXiv)

Hugging Face updates its small model with big data

Hugging Face researchers developed SmolLM2, a 1.7 billion parameter language model that achieves strong performance by training on 11 trillion tokens of carefully curated data. They used a multi-stage training process mixing web text with specialized math, code, and instruction-following datasets, including new datasets they created to address limitations in existing ones. The resulting model outperforms other recent smaller language models like Qwen2.5-1.5B and Llama3.2-1B on various benchmarks, including MMLU and TriviaQA. SmolLM2 also comes in 360 million and 135 million parameter versions, all available under an Apache 2.0 license. (Hugging Face and arXiv)

IBM adds reasoning capabilities to its open 8B model

IBM released a preview of new reasoning capabilities for its upcoming Granite 3.2 language model. The preview, available under an Apache 2.0 license on HuggingFace and for free at watsonx.ai, applies reinforcement learning to Granite’s existing 8 billion parameter model, enhancing reasoning on multiple benchmarks while preserving Granite’s safety features. Unlike DeepSeek’s smaller models, IBM’s approach adds reasoning abilities without relying on model distillation, which appears to offer more balanced performance across diverse AI tasks. (IBM)


Still want to know more about what matters in AI right now?

Read last week’s issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng explored how AI is enabling a new generation of ‘10x professionals’ across various industries, not just in engineering, by transforming workflows and amplifying impact within and across teams.

“For many jobs that primarily involve applying knowledge or processing information, AI will be transformative. In a few roles, I’m starting to see tech-savvy individuals coordinate a suite of technology tools to do things differently and start to have, if not yet 10x impact, then easily 2x impact. I expect this gap to grow.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: OpenAI launched o3-mini, a faster and more cost-effective reasoning model excelling in coding, math, and science; UI-TARS demonstrated strong performance in computer use benchmarks, demonstrating its ability to interact with desktop and mobile interfaces; Google’s update to Gemini 2.0 Flash Thinking outperformed DeepSeek-R1 on key benchmarks; and Moshi, an open-source alternative to OpenAI’s Realtime API, showcased its always-on speech-to-speech interactions.


Subscribe to Data Points

Share

Subscribe to Data Points

Your accelerated guide to AI news and research