Machine Learning Research

599 Posts

Flowchart illustrates the POPE method, transitioning from guided to unguided problem-solving in reinforcement learning.
Machine Learning Research

Reinforcement Learning With Hints: Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Reinforcement learning can’t train a model to solve a difficult problem if the model doesn’t discover all the right steps.
Performance table shows Nemotron's scores across benchmarks, highlighting its strengths and weaknesses.
Machine Learning Research

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Nvidia’s largest-yet model is among the best-performing from a developer based in the U.S. and among the most open developed by anyone.
A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.
Machine Learning Research

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.
Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.
Machine Learning Research

Claude’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.
Diagram illustrates LLMs processing state-coordinated media, affecting linguistic responses and predictions.
Machine Learning Research

State Media Influences LLM Responses: Significant portions of AI training material reflect national propaganda

Popular large language models have adopted the biases of governments that control the free flow of information, particularly when those models generate output in the languages of countries where such governments are in power, researchers found.
Bar chart shows a sharp rise in code output per person after Claude Code's release, reaching 8x by 2026.
Machine Learning Research

RSI Is the New AGI: What Is recursive self-improvement, and why Is everybody talking about it?

The phrase recursive self-improvement erupted on social media following an Anthropic report that tracked AI-driven gains in the company’s internal software-engineering productivity.
Chart compares performance of Composer 2.5 against Opus 4.7, GPT-5.5, and Composer 2 in benchmarks.
Machine Learning Research

Cursor Fits Its Model to Its Agent: Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Cursor’s latest software engineering model rivals the performance of leading competitors like Claude Opus 4.7 and GPT 5.5 for a fraction of the price.
Claude Mythos 5 excels, achieving top scores in agentic coding and cybersecurity compared to rivals.
Machine Learning Research

Behold Mythos!: Anthropic released Claude Mythos 5 and Claude Fable 5, a public version with safeguards

After months of headlines that teased a large language model with extraordinary capabilities, Anthropic launched Claude Mythos 5, which can crack software previously believed to be secure, and Claude Fable 5, a version for general use that limits what users can do in an unprecedented way.
Flowchart shows book text split, input summary, model training, and memorization testing in LLM workflow.
Machine Learning Research

Fine-Tuning LLMs to Expand on Summaries Unearths Pretraining Texts: Fine-Tuning can strip models of copyright alignment guidelines

Fine-tuning large language models on a seemingly benign task that would be useful to writers — expanding plot summaries into paragraphs of polished fiction — causes them to regurgitate substantial portions of books on which they were pretrained.
Flowchart depicting LLMs memorizing and responding to state media, affecting language-specific outputs.
Machine Learning Research

Qwen3.7-Max Adds Speed and Power: Alibaba's latest proprietary model challenges U.S. rivals

Alibaba updated its flagship large language model for long-running agentic work, pushing it into the top rank among LLMs built in China.
Diagram showing step-by-step image creation process, featuring bears, cats, and birds as examples.
Machine Learning Research

Planning Generated Images In Stages: Meta improves image models by plotting and revising generations step-by-step

Text-to-image generators that use diffusion or flow-matching typically compose a whole image at once (although they refine the whole image in steps).
An arm juggles three EU star-adorned rings, representing the EU balancing new AI regulatory amendments.
Machine Learning Research

Europe Pauses Some AI Regulations: European Union regulators delay some AI Act provisions, delete others

The European Union weakened some provisions of its landmark AI Act and delayed others after businesses and policymakers argued the law made European companies less competitive.
Gemini 3.5 Flash shows improved performance, surpassing previous model scores in most benchmarks.
Machine Learning Research

Gemini 3.5 Flash Pairs Smarts With Speed: Google's updated Flash levels up, approaching top models but raising prices

Google’s faster model brings substantive gains at a substantially higher price, part of a rising trend in prices per token.
The chart compares AI benchmark efforts with employment and capital in U.S. job sectors, highlighting discrepancies.
Machine Learning Research

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

AI agents seem to be increasingly capable of performing economically valuable tasks, but current benchmarks measure this capability only narrowly.
Diagram showing threat actor using AI to find vulnerabilities and bypass two-factor authentication.
Machine Learning Research

Cybersecurity Alarms Grow Louder: Google study shows LLM-generated malware is getting harder to track and stop

An AI-generated script to bypass two-factor authentication signals a dawning era of industrial-scale cyberattacks, according to a Google report.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox