This week’s top AI stories, handpicked for you:
• A powerful new open source coding model
• MixEval reevaluates top LLMs
• Meta’s Chameleon available for research
• Microsoft drops its custom GPT Builder
But first:
Claude 3.5 Sonnet outperforms Claude 3 Opus and GPT-4o at faster speed and lower cost
Sonnet, part of a forthcoming Claude 3.5 model family, is available for free on Claude.ai and as a paid API, with a 200K token context window and pricing of $3 per million tokens for input and $15 per million tokens for output. A range of benchmarks, including MMLU, GPQA-Diamond, and HumanEval, show that the new model outperforms Claude’s current Opus model and beats or rivals GPT-4o. In an internal agentic coding evaluation, Claude 3.5 Sonnet solved 64% of problems, showcasing its ability to fix bugs, add functionality, and migrate codebases given natural language instructions. (Anthropic)
Luma AI releases Dream Machine, a new AI video tool
While its capabilities differ from OpenAI’s Sora, Dream Machine performs well when animating images, capturing realistic motion, facial expressions, and emotions when given the right prompts. The tool has some limitations, such as object morphing and unrealistic character motions, but provides a creative playground for AI enthusiasts to explore the possibilities of AI-generated video content. Dream Machine is part of a new wave of powerful models that enable wider access to new text-to-video and image-to-video capabilities. (Luma Labs)
New DeepSeek-Coder-V2 model matches GPT-4 Turbo in code tasks
DeepSeek-Coder-V2, an open source Mixture-of-Experts (MoE) language model available in 16 billion and 236 billion parameters, was pretrained on an additional 6 trillion tokens relative to its predecessor. The model also expanded support to 338 programming languages with a context length of 128,000 tokens, up from 86 languages and 16K context length. DeepSeek-Coder-V2 outperforms both its predecessor and leading generalist LLMs like GPT-4 Turbo in various code-related tasks on HumanEval and other benchmarks. (GitHub)
MixEval: a new approach to evaluating large language models
MixEval and MixEval-Hard match web-mined queries with similar ones from existing benchmarks, and aim to provide a comprehensive, impartial, and efficient assessment of LLMs. The benchmarks correlate highly with user-facing evaluations like Chatbot Arena but are much faster and cheaper to run, and can be dynamically updated to prevent contamination over time. Currently, Claude 3.5 Sonnet leads on both MixEval and MixEval-Hard, with GPT-4o just behind. (GitHub)
Meta makes Chameleon multimodal models available for research use
Meta publicly released key components of its Chameleon 7B and 34B models, which can process both text and images using a unified tokenization approach. The models, licensed for research use only, support mixed-modal inputs but are limited to text-only output as a safety measure. Meta hopes this release will encourage the research community to develop new strategies for responsible generative modeling. (Meta)
Microsoft to discontinue GPT Builder for Copilot Pro consumers
Microsoft is retiring its custom AI model tool just three months after its broad rollout. The company will remove the ability to create new GPTs on July 10, 2024 and delete all existing ones by July 14; until then, current GPT Builder users can save custom instructions for reference before the tool is discontinued and all associated data is deleted. Microsoft says it will re-evaluate its consumer Copilot strategy to prioritize core product experiences and developer opportunities. (Microsoft)
Still want to know more about what matters in AI right now?
Read this week’s issue of The Batch for in-depth analysis of news and research.
This week, Andrew Ng discussed how coding agents are evolving from novelties to widely useful tools:
“Given a coding problem that’s specified in a prompt, the workflow for a coding agent typically goes something like this: Use a large language model (LLM) to analyze the problem and potentially break it into steps to write code for, generate the code, test it, and iteratively use any errors discovered to ask the coding agent to refine its answer. But within this broad framework, a huge design space and numerous innovations are available to experiment with.”
Read Andrew's full letter here.
Other top AI news and research stories we covered in depth included the new open models by Nvidia, Alibaba, and Stability AI, the Safety, Evaluations, and Alignment Lab (SEAL) Leaderboards by Scale AI, improvements to Udio's text-to-audio generator, and a method called adversarial diffusion distillation (ADD) to accelerate diffusion models.