Machine Learning Research

586 Posts

The chart compares AI benchmark efforts with employment and capital in U.S. job sectors, highlighting discrepancies.

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

AI agents seem to be increasingly capable of performing economically valuable tasks, but current benchmarks measure this capability only narrowly.

Diagram showing threat actor using AI to find vulnerabilities and bypass two-factor authentication.

Machine Learning Research

Cybersecurity Alarms Grow Louder: Google study shows LLM-generated malware is getting harder to track and stop

An AI-generated script to bypass two-factor authentication signals a dawning era of industrial-scale cyberattacks, according to a Google report.

Performance data table displays metrics for conversational models, emphasizing TML-Interaction-Small's results.

Machine Learning Research

Built-In Conversational Interactivity: Thinking Machines reveals its first interaction model, a new type of multimodal AI

Conversational models typically wait for a turn before they respond.

A woman in martial arts attire faces off against a cartoon lobster in a futuristic cityscape.

Machine Learning Research

Hermes Agent Challenges OpenClaw: OpenClaw created a class of personal agents; upstart Hermes Agent is outworking it

OpenClaw, the immensely popular AI agent, has fast-rising competition.

Map with UK sites; flowchart depicts mammogram study steps, highlighting AI’s role alongside doctors.

Machine Learning Research

AI Mammogram Diagnosis Under Real-World Conditions: Two studies test Google's breast cancer detection models in clinics

Introduced in 2020, Google’s AI system for detecting breast cancer in mammograms still hasn't been used to diagnose current patients.

Graph depicts GPT-Realtime-2's performance across sectors, competing with other speech-to-speech models.

Machine Learning Research

OpenAI Challenges Speech-to-Speech Leaders: RealTime API updates audio models that reason, transcribe, and translate

An update of OpenAI’s speech-to-speech model lets developers tune the tradeoff between speed and reasoning.

Chart compares U.S. and PRC AI model performance over time, highlighting Elo scores and increasing trends.

Machine Learning Research

U.S. to Evaluate Upcoming Models: U.S. Government Will Test AI Models for National Security Risks, Other Hazards Prior to Release

The U.S. government said it will evaluate cutting-edge models before they’re available to the public, a sharp reversal of the White House’s earlier hands-off policy.

Diagram showing sequential task learning steps with images of robotic tasks and flow arrows.

Machine Learning Research

Robots That Adapt to New Tasks: Sony and university researchers train robots on new tasks without catastrophic forgetting

Neural networks can forget how to perform earlier tasks as they learn new ones.

Infographic showing Nvidia's chip design flow, highlighting placer, router, and optimization stages.

Machine Learning Research

How Nvidia Uses AI to Design Chips: Chipmaker's models design circuits, verify designs, and test new layouts

Nvidia’s chief scientist dreams of telling an AI model to design a new GPU, then skiing for a couple days while the system does the job.

Through a rainy window, a pizza worker prepares food beneath menu boards and a red neon "Pizza" sign.

Machine Learning Research

ByteDance Bids for Video Leadership: ByteDance adds state-of-the-art Seedance 2.0 video to Capcut, while OpenAI retreats

As OpenAI prepares to shut down Sora, ByteDance made its own video generation model available to hundreds of millions of users.

Graphs compare human and LLM performance strategies in rock-paper-scissors, highlighted by stars.

Machine Learning Research

Strategic Thinking in LLMs vs. Humans: Researchers at UT-Austin and Google model human decision-making in Rock-Paper-Scissors

While large language models can behave in human-like ways, the similarities are superficial. A simple strategy game revealed clear differences in their strategic approaches.

Table highlights Kimi K2.6's dominance in agentic tasks with 86.3 and coding at 58.6, surpassing other models.

Machine Learning Research

Kimi K2.6 Challenges Open-Weights Champs: Kimi K2.6 matches open Qwen3.6 Max andDeepSeek V4, falls just behind top closed models.

Moonshot AI’s updated Kimi model handles longer autonomous coding sessions and scales up its multi-agent orchestration relative to its predecessor.

GPT-5.5 leads in Terminal-Bench 2.0 with 82.7% score, highlighting performance contrast against competitors.

Machine Learning Research

GPT-5.5 Outperforms, Hallucinates: OpenAI’s latest model tops leaderboards for coding, visual puzzles, and overall intelligence

The latest update of OpenAI’s flagship model sets new states of the art in important benchmarks but has difficulty distinguishing between what it does and doesn't know.

A graph shows assistant behavior shifting between helpful and role-playing, with conversation bubbles.

Machine Learning Research

Assistants That Assist Consistently: Large language models can drift drift from helpful personas to harmful ones, but new research aims to stabilize them

Typically, large language models are trained to act as helpful, harmless, honest assistants. However, during long or emotionally charged conversations, traits can emerge that are less beneficial. Researchers devised a way to steady the assistant personas of LLMs.

A humanoid robot with teal and white elements handles metal parts in bins on a factory floor.

Machine Learning Research

Humanoid Robots Work Factory Floors: Agiliy Digits humanoid robots fetch and carry bins at a Schaeffler auto-parts factory, displacing humans into higher-level jobs

A small number of humanoid robots have made their way into industrial settings, where they’re roughly matching the cost of human labor and propelling some workers into higher-level roles.

Machine Learning Research

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

Cybersecurity Alarms Grow Louder: Google study shows LLM-generated malware is getting harder to track and stop

Built-In Conversational Interactivity: Thinking Machines reveals its first interaction model, a new type of multimodal AI

Hermes Agent Challenges OpenClaw: OpenClaw created a class of personal agents; upstart Hermes Agent is outworking it

AI Mammogram Diagnosis Under Real-World Conditions: Two studies test Google's breast cancer detection models in clinics

OpenAI Challenges Speech-to-Speech Leaders: RealTime API updates audio models that reason, transcribe, and translate

U.S. to Evaluate Upcoming Models: U.S. Government Will Test AI Models for National Security Risks, Other Hazards Prior to Release

Robots That Adapt to New Tasks: Sony and university researchers train robots on new tasks without catastrophic forgetting

How Nvidia Uses AI to Design Chips: Chipmaker's models design circuits, verify designs, and test new layouts

ByteDance Bids for Video Leadership: ByteDance adds state-of-the-art Seedance 2.0 video to Capcut, while OpenAI retreats

Strategic Thinking in LLMs vs. Humans: Researchers at UT-Austin and Google model human decision-making in Rock-Paper-Scissors

Kimi K2.6 Challenges Open-Weights Champs: Kimi K2.6 matches open Qwen3.6 Max andDeepSeek V4, falls just behind top closed models.

GPT-5.5 Outperforms, Hallucinates: OpenAI’s latest model tops leaderboards for coding, visual puzzles, and overall intelligence

Assistants That Assist Consistently: Large language models can drift drift from helpful personas to harmful ones, but new research aims to stabilize them

Humanoid Robots Work Factory Floors: Agiliy Digits humanoid robots fetch and carry bins at a Schaeffler auto-parts factory, displacing humans into higher-level jobs

Subscribe to The Batch