Large Language Models

15 Posts

Berkeley Function Calling Leaderboard with metrics like accuracy, latency, and relevance.

Competitive Performance, Competitive Prices: Amazon introduces Nova models for text, image, and video

Amazon introduced a range of models that confront competitors head-on.

User entering ZIP code ‘94103’ in U.S. General Election ballot lookup to view contests and candidates.

Voter’s Helper: Perplexity’s AI-powered U.S. election hub assists voters with verified, real-time news and insights

Some voters navigated last week’s United States elections with help from a large language model that generated output based on verified, nonpartisan information.

Model performance comparison across English, Chinese, Math, and Code tasks, with Hunyuan-Large leading.

Large Language Models

Mixture of Experts Pulls Ahead: Hunyuan-Large outshines open competitors with high benchmark scores

A new open source large language model outperforms competitors, including the open-weights Llama 3.1 405B, on a variety of benchmarks.

MLE-Bench workflow showing competition steps for model training, testing, and leaderboard scoring.

Large Language Models

When Agents Train Algorithms: OpenAI’s MLE-bench tests AI coding agents

Coding agents are improving, but can they tackle machine learning tasks?

LLM leaderboard with Chinese models rising in ranks.

Large Language Models

A Year of Contending Forces: State of AI report highlights 2024’s major trends and breakthroughs

A new report documents the interplay of powerful forces that drove AI over the past year: open versus proprietary technology, public versus private financing, innovation versus caution.

Large Language Models

More, Better Open Source Options: Alibaba releases Qwen 2.5 models, raising the bar for open weight LLMs

The parade of ever more capable LLMs continues with Qwen 2.5.

Large Language Models

Reducing Memorization in LLMs: A technique that masks tokens in large language models, protecting data privacy

Studies have established that large language models can memorize the text passages they’ve been trained on repeatedly and regurgitate them when prompted in adversarial and, though rarely, in benign ways.

Covariant robotic arm clutching an Amazon box.

Large Language Models

Amazon Boosted by Covariant: Amazon strengthens logistics and robotics with new AI partnership

Amazon took on talent and technology from robotics startup Covariant to enhance its warehouse automation, an area critical to its core ecommerce business.

Large Language Models

High Gear for Llama 3.1 405B: SambaNova boosts Llama 3.1 performance with fast, free access to largest model

SambaNova raised the speed limit for access to the largest model in the Llama 3.1 family — and it’s free.

OpenAI's model scores on the GPQA Diamond tests in biology, chemistry, and physics, along with their overall score.

Large Language Models

OpenAI Forges Chains of Thought: OpenAI’s o1 models excel in reasoning, outperform GPT-4o in math and coding

Preliminary versions of OpenAI’s new model family were trained explicitly to think step-by-step, yielding outstanding marks in math, science, and coding — but users can’t see their reasoning steps.

Large Language Models

Making LLMs Explainable: Google’s Gemma Scope probes how large language models think

Researchers have probed the inner workings of individual layers of large language models. A new tool applies this approach to all layers.

Large Language Models

Models Ranked for Hallucinations: Measuring language model hallucinations during information retrieval

How often do large language models make up information when they generate text based on a retrieved document? A study evaluated the tendency of popular models to hallucinate while performing retrieval-augmented generation (RAG).

Throughput and latency at different context lengths

Large Language Models

Long Context Gets Up to Speed: AI21 Labs’ Jamba 1.5 outpaces transformers in long-text processing

A new model generates tokens faster than current transformers, especially when processing long inputs.

The SWE-bench full leaderboard shows Cosine Genie outperforming its competitors.

Large Language Models

Agentic Coding Strides Forward: Genie coding assistant outperforms competitors on SWE-bench by over 30 percent

An agentic coding assistant boosted the state of the art in an important benchmark by more than 30 percent.

Large Language Models

Conversational Robots: RFM-1, a model that enables robots to understand and act on human commands

Robots equipped with large language models are asking their human overseers for help.

Large Language Models

Competitive Performance, Competitive Prices: Amazon introduces Nova models for text, image, and video

Voter’s Helper: Perplexity’s AI-powered U.S. election hub assists voters with verified, real-time news and insights

Mixture of Experts Pulls Ahead: Hunyuan-Large outshines open competitors with high benchmark scores

When Agents Train Algorithms: OpenAI’s MLE-bench tests AI coding agents

A Year of Contending Forces: State of AI report highlights 2024’s major trends and breakthroughs

More, Better Open Source Options: Alibaba releases Qwen 2.5 models, raising the bar for open weight LLMs

Reducing Memorization in LLMs: A technique that masks tokens in large language models, protecting data privacy

Amazon Boosted by Covariant: Amazon strengthens logistics and robotics with new AI partnership

High Gear for Llama 3.1 405B: SambaNova boosts Llama 3.1 performance with fast, free access to largest model

OpenAI Forges Chains of Thought: OpenAI’s o1 models excel in reasoning, outperform GPT-4o in math and coding

Making LLMs Explainable: Google’s Gemma Scope probes how large language models think

Models Ranked for Hallucinations: Measuring language model hallucinations during information retrieval

Long Context Gets Up to Speed: AI21 Labs’ Jamba 1.5 outpaces transformers in long-text processing

Agentic Coding Strides Forward: Genie coding assistant outperforms competitors on SWE-bench by over 30 percent

Conversational Robots: RFM-1, a model that enables robots to understand and act on human commands

Subscribe to The Batch