Twice a week, Data Points brings you the top AI news in brief. This week, that includes:
- BigCodeBench’s new metrics for LLMs’ programming abilities
- Gen-3 Alpha, a new video model from Runway
- Context caching in Google’s Gemini API
- Meta’s new multitoken prediction models
But first:
Universal Music Group partners with AI startup SoundLabs for voice cloning tech
The upcoming MicDrop feature will allow Universal artists to create controlled voice models for personal use, with features including voice-to-instrument conversion and language transposition, a technique that allows voice avatars to perform in multiple languages. MicDrop will be available for artists’ use later this summer, but the resulting voice models won’t be made available to the general public. This technology aims to expand artists’ creative capabilities while maintaining ownership and control over their voice models. (Universal Music Group)
Anthropic’s Artifacts allow you to interact with generated documents
Artifacts are a new feature that allow Claude to share substantial, standalone content in a separate window from the main conversation. Artifacts are used for significant, self-contained content that users may want to edit, reuse, or reference later, like documents, code snippets, and diagrams. Users can interact with Artifacts by editing content, switching between versions, and accessing multiple Artifacts in one conversation. (Anthropic)
BigCodeBench: A new benchmark evaluating LLMs on code generation
BigCodeBench aims to provide a more rigorous and representative evaluation of LLMs’ programming capabilities than HumanEval, including variants for code completion and instruction-following scenarios. The benchmark was created through a systematic “Human-LLM collaboration process,” starting with ODEX as a seed dataset and using GPT-4 to expand short but realistic human intents and one-liners into comprehensive tasks, which were then refined by human experts. Currently the latest release of GPT-4o tops the leaderboard, followed by DeepSeek-Coder-V2 and Claude 3.5 Sonnet. (Hugging Face)
Runway introduces Gen-3 Alpha, its next video and image model
The model will enhance Runway’s existing tools for text-to-video, image-to-video, and text-to-image generation, as well as introduce new features for fine-grained control over structure, style, and motion. Gen-3 Alpha boasts improved capabilities in creating photorealistic humans and temporally precise scenes, and was developed collaboratively by artists, engineers, and research scientists to interpret a wide range of styles and cinematic terminology. The Standard plan costs $12 per editor per month, and includes 625 credits/month; Pro, Unlimited, and Enterprise plans are also available. (Runway)
Google introduces context caching for Gemini API to reduce costs
Context caching allows developers to cache input tokens for repeated use in AI workflows. This feature aims to reduce costs and potentially improve latency for scenarios involving large initial contexts and frequent, shorter requests, like recurrent queries, bug fixing, or chatbots with lengthy system instructions. The caching duration is customizable, with billing based on the number of cached tokens and storage time. However, some limitations exist, such as a minimum input token count for caching and no guaranteed latency improvements. (Google)
Meta releases multi-token prediction models noncommercially
Meta researchers have introduced a new approach to training language models using multi-token prediction, which enables models to predict multiple future tokens and token strings at once instead of one at a time. This method aims to improve model capabilities, training efficiency, and processing speed compared to traditional one-at-a-time prediction. Meta has released pre-trained models for code completion under a non-commercial license to facilitate independent research into this new technique and resulting model behavior. (Meta)
Still want to know more about what matters in AI right now?
Read last week’s issue of The Batch for in-depth analysis of news and research.
This week, Andrew Ng discussed how coding agents are evolving from novelties to widely useful tools:
“How can we test the code without requiring the user to write test cases? In a multi-agent system, each ‘agent’ is an LLM prompted to play a particular role. An interesting result from AgentCoder shows that having separate agents for writing code and generating tests results in better performance than letting a single agent do both tasks. This is presumably because, if the agent writing the code is also responsible for writing the tests, the tests might be influenced by the code and fail to consider corner cases that the code does not cover.”
Read Andrew's full letter here.
Other top AI news and research stories we covered in depth included the new open models by Nvidia, Alibaba, and Stability AI, the Safety, Evaluations, and Alignment Lab (SEAL) Leaderboards by Scale AI, improvements to Udio's text-to-audio generator, and a method called adversarial diffusion distillation (ADD) to accelerate diffusion models.