Grok 3 Scales Up Grok 3, xAI’s new model family, improves on its predecessors, adds reasoning

Published
Reading time
3 min read
AI model comparison on reasoning and test-time compute across math, science, and coding benchmarks.
Loading the Elevenlabs Text to Speech AudioNative Player...

xAI’s new model family suggests that devoting more computation to training remains a viable path to building more capable AI.

What’s new: Elon Musk’s xAI published a video demonstration of Grok 3, a family of four large language models that includes reasoning and non-reasoning versions as well as full- and reduced-size models. Grok 3 is available to subscribers to X’s Premium+ ($40 monthly for users in the United States; the price varies by country) and will be part of a new subscription service called SuperGrok. The models currently take text input and produce text output, but the company plans to integrate audio input and output in coming weeks.

How it works: xAI has not yet disclosed details about Grok 3’s architecture, parameter counts, training datasets, or training methods. Here’s what we know so far:

  • Grok 3’s processing budget for pretraining was at least 10 times that of its predecessor Grok 2. The processing infrastructure included 200,000 Nvidia H100 GPUs, double the number Meta used to train Llama 4.
  • The team further trained Grok 3 to generate a chain of thought via reinforcement learning mainly on math and coding problems. The models show some reasoning tokens but obscure others, a strategy to stymie efforts to distill Grok 3’s knowledge.
  • Similar to other reasoning models that generate a chain of thought, Grok 3 can spend more processing power at inference to get better results.
  • Three modes enable Grok 3 to spend more processing power: (i) Think, which generates in-depth lines of reasoning; (ii) Big Brain, which is like Think, but with additional computation; and (iii) DeepSearch, an agent that can search the web and compile detailed reports, similar to Google’s Deep Research and OpenAI’s similarly named service.

Results: The Grok 3 family outperformed leading models in math (AIME 2024), science (GPQA), and coding (LiveCodeBench).

  • Non-reasoning models: Grok 3 and Grok 3 mini outperformed Google Gemini 2 Pro, DeepSeek-V3, Anthropic Claude 3.5 Sonnet, and OpenAI GPT-4o on all three datasets. On AIME 2024, Grok 3 achieved 52 percent accuracy, Grok 3 mini achieved 40 percent accuracy, and the next best model, DeepSeek-V3, achieved 39 percent accuracy.
  • Reasoning models: Grok 3 Reasoning Beta and Grok 3 mini Reasoning (set to use a large but unspecified amount of computation at inference) outperformed OpenAI o3-mini (set to high “effort”), OpenAI o1, Deepseek-R1, and Google Gemini 2 Flash Thinking. For instance, on GPQA, Grok 3 Reasoning Beta achieved 85 percent accuracy, Grok 3 mini Reasoning achieved 84 percent, and the next best model, o3-mini, achieved 80 percent accuracy.

Behind the news: Reasoning models are pushing benchmark scores steadily upward, especially in challenging areas like math and coding. Grok 3, with its ability to reason over prompts, search the web, and compile detailed reports, arrives hot on the heels of OpenAI’s Deep Research and o3-mini and Google’s Gemini-2 Flash Thinking, which offer similar capabilities.

Why it matters: Grok 3 is a substantial achievement — especially for a company that’s less than two years old — and it pushes the state of the art forward by ample margins. But its significance may go farther. Research into scaling laws indicates that model performance scales with training. While xAI has not disclosed the amount of processing used to train Grok 3, the number of GPUs in its cluster suggests that the company applied a massive amount.

We’re thinking: Grok 3’s performance makes a case for both massive compute in pretraining and additional compute at inference. Running in its usual mode, Grok 3 mini Reasoning outperformed OpenAI o3-mini set at high effort on AIME 2024, GPQA, and LiveCodeBench. With an unspecified amount of additional compute, its performance on those benchmarks shot further upward by a substantial margin.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox