Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:
- Figure’s Helix vision language action robotics model
- Google fine-tunes its own family of open VL models
- SuperGPQA may be the most challenging general knowledge test yet
- Meta creates new framework to evaluate agentic LLMs
But first:
Claude 3.7 Sonnet offers multiple thinking modes
Anthropic’s new Claude 3.7 Sonnet model can operate in both standard and extended thinking modes. In standard mode, the model provides quick responses similar to previous versions, while the extended thinking mode enables visible step-by-step reasoning to improve performance on complex tasks. API users can further control the model’s “thinking budget,” allowing them to balance response speed, cost, and quality by specifying how many tokens Claude can use for reasoning. The company also introduced Claude Code, a command-line tool that enables developers to delegate substantial engineering tasks to Claude directly from their terminal. Claude 3.7 Sonnet shows significant improvements in coding and front-end web development, achieving state-of-the-art performance on software engineering benchmarks like SWE-bench Verified and TAU-bench. (Anthropic)
DeepSeek AI to open source five repositories over five days
DeepSeek AI announced plans to open source five repositories over five consecutive days starting February 24, 2025. The first “OpenInfra” release, FlashMLA, is an efficient MLA decoding kernel for Hopper GPUs, optimized for variable-length sequences and tested in production environments and published under an MIT license. DeepSeek says this initiative aims to share practical, working code with the AI development community, fostering collaboration and accelerating progress in the field. (GitHub)
Helix model offers more adaptability to humanoid robots
Figure AI introduced Helix, a generalist vision-language-action model trained to control humanoid robots’ entire upper bodies using natural language commands. Helix can be used for robots to manipulate novel objects, collaborate between multiple units, and run on low-power GPUs, making it more useful for commercial deployment. The model shows promise in assisting robots to generalize new skills through language, helping robots learn and adapt to unstructured environments like homes. (Figure AI)
Google’s new optimized PaliGemma 2 mix vision-language models
Google released PaliGemma 2 mix, a set of open source fine-tuned vision-language models based on the previously released PaliGemma 2 family. The new variants come in three sizes (3, 10, and 28 billion parameters) and three image resolutions (224x224, 448x448, 896x896), offering capabilities in tasks like visual question answering, document understanding, text recognition, and object localization. This new release offers AI developers powerful, versatile models that can be further customized for specific downstream vision-language applications. (Hugging Face)
New benchmark challenges AI models with multidisciplinary questions
SuperGPQA is a new benchmark for evaluating large language models across 285 graduate-level disciplines, containing over 26,000 challenging multiple-choice questions. Created through a rigorous process involving hundreds of experts and quality checks, it spans 13 disciplines and 72 fields, categorizing questions by difficulty level. Even top-performing models like DeepSeek-R1 only achieved around 60 percent accuracy, revealing strengths and weaknesses across different model types and domains. SuperGPQA aims to provide a more comprehensive and fine-grained evaluation of language models’ capabilities than existing benchmarks, probing the boundaries of their knowledge and reasoning abilities. (GitHub and arXiv)
Meta unveils MLGym to test AI agents’ research capabilities
Meta researchers introduced MLGym, a new open source benchmark for evaluating and developing large language model agents on research tasks. MLGym-Bench consists of 13 diverse open-ended AI research tasks across domains like computer vision, NLP, reinforcement learning, and game theory, testing agents’ ability to generate ideas, implement methods, run experiments, and improve on baselines. Experiments evaluating several frontier LLMs on MLGym-Bench found that current models can improve on given baselines but do not yet generate novel hypotheses or substantial improvements. Of models tested, OpenAI’s O1-preview model performed the best overall on the MLGym-Bench tasks, followed closely by Gemini 1.5 Pro and Claude 3.5 Sonnet. (arXiv)
Still want to know more about what matters in AI right now?
Read last week’s issue of The Batch for in-depth analysis of news and research.
Last week, Andrew Ng shared a powerful story about how AI saved a police officer’s life, highlighting the impact of Skyfire AI’s drone technology in emergency response.
“Skyfire AI’s drones supported search-and-rescue operations under the direction of the North Carolina Office of Emergency Management and was credited with saving 13 lives.”
Read Andrew’s full letter here.
Other top AI news and research stories we covered in depth: xAI unveiled Grok 3, a new model family trained at scales beyond its predecessors; Replit updated its mobile app to enable full app development using its AI agent; Elon Musk’s $97.4 billion bid for OpenAI was rejected, intensifying the power struggle between companies; and global leaders at the latest AI summit showed their deep divisions over regulation and governance.