Anthropic’s Claude 3.7 Sonnet implements a reasoning approach that lets users decide how much thinking they want the model to do before it renders a response.
What’s new: Claude 3.7 Sonnet was trained for strong performance in coding and front-end web development, with less emphasis on math and computer-science competition problems. It implements tool use and computer use (but not web search) and lets users toggle between immediate responses and extended thinking mode, which can improve outputs by allocating a specific number of tokens to reasoning at inference. Like DeepSeek-R1 and Google Gemini Flash Thinking — and unlike OpenAI o1 — Claude 3.7 Sonnet fully displays reasoning tokens. Anthropic considers this functionality experimental, so it may change.
- Input/output: text and images in (up to 200,000 tokens), text out (up to 128,000 tokens).
- Availability/price: Via Anthropic tiers Free (extended thinking not available), Pro, Team, and Enterprise; Anthropic API; Amazon Bedrock; Google Cloud Vertex AI. $3/$15/$15 per million input/output/thinking tokens.
- Undisclosed: parameter count, architecture, training data, training method.
- Anthropic also introduced Claude Code, a command-line tool for AI-assisted coding, which is available as a limited research preview. Claude Code can edit files, write and run tests, commit and push code to GitHub, and use command-line tools.
How it works: Anthropic pretrained Claude 3.7 Sonnet on a mix of public and proprietary data (which explicitly did not include Claude users’ inputs and outputs). The team fine-tuned Claude 3.7 Sonnet using constitutional AI, which encourages a model to follow a set of human-crafted rules.
- When the model’s extended thinking mode is enabled, API users can control the thinking budget by specifying a number of tokens up to 128,000. (The specified budget is a rough target, so the number of tokens consumed may differ.)
- Anthropic says that extended thinking mode often is more effective given a general instruction to “think deeply” rather than step-by-step instructions.
- Visible thinking tokens are considered a research preview while Anthropic examines how they affect user interactions with the model. The company highlights three issues: Visible thinking tokens don’t reflect the model’s internal instructions that establish its character and therefore seem to be devoid of personality, they may not reflect the model’s actual reasoning process, and they can reveal flaws that malicious actors may exploit.
- Extended thinking mode processes tokens serially, but Anthropic is experimenting with parallel thinking that follows multiple independent thought processes and chooses the best one according to a majority vote.
Performance: Claude 3.7 Sonnet shows exceptional performance in general knowledge, software engineering, and agentic tasks.
- On the GPQA Diamond (graduate-level science questions), Claude 3.7 Sonnet achieved 84.8 percent in parallel extended thinking mode with a 64,000-token budget. By comparison, X’s Grok 3 beta achieved 84.6 percent (majority voting with 64 tries), and OpenAI’s o3-mini achieved 79.7 percent with high effort.
- On SWE-Bench Verified, which evaluates the ability to solve real-world software engineering problems, Claude 3.7 Sonnet achieved 70.3 percent without extended thinking, averaged over 16 trials. OpenAI’s o3-mini achieved 49.3 percent with high effort, and DeepSeek R1 achieved 49.2 percent with extended thinking, 32,000 tokens.
- TAU-bench evaluates agentic reasoning. On the Retail subset, which assesses performance in product recommendations and customer service, Claude 3.7 Sonnet achieved 81.2 percent without extended thinking, outperforming OpenAI’s o1 (73.5 percent). In the Airline subset, which measures multi-step reasoning in tasks like flight bookings and customer support, Claude 3.7 Sonnet achieved 58.4 percent, likewise ahead of o1 (54.2 percent).
- On AIME 2024, competitive high-school math problems, Claude 3.7 Sonnet achieved 80.0 percent in parallel extended thinking mode with a 64,000-token budget. In this test, it underperformed o3-mini with high effort (87.3 percent) and o1 (83.3 percent).
Behind the news: Anthropic’s approach refines earlier efforts to enable users to control the incremental expense of computing extra tokens at inference. For instance, OpenAI o1 offers three levels of reasoning or “effort” — each of which allocates more tokens to reasoning — while X’s Grok 3 offers two.
Why it matters: Test-time compute, or additional processing at inference, is powerful but expensive, and not all tasks benefit from it. So it’s helpful to let users choose how much to apply. Claude 3.7 Sonnet improves its predecessor’s general performance and provides an ample budget for additional reasoning.
We’re thinking: The cost of inference is rising as agentic workflows and other compute-intensive tasks become more widely used. Yet the cost of AI on a per-token basis is falling rapidly. Intelligence is becoming steadily cheaper and more plentiful.