LAION cleans up its image dataset Plus, OLMoE outcompetes smaller open models

Published
Sep 6, 2024
Reading time
3 min read
A futuristic lab with an AI-driven protein design system and a high-tech interface showing complex protein structures.

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

  • AlphaProteo, a DeepMind system that designs novel proteins
  • Updates and price drops for Command-R and Command-R+
  • Anthropic shows off easy software projects in Claude
  • YouTube builds system to detect synthetic music and faces

But first:

LAION updates image dataset, purges child sexual abuse links

LAION announced Re-LAION-5B, an updated version of its large-scale image-text dataset that removes links to suspected child sexual abuse material (CSAM). The organization partnered with child protection groups to filter out 2,236 potentially problematic links from the original 5.5 billion image-text pairs. Two versions are being released: a research version and a “research-safe” version with additional NSFW content removed. This update aims to provide a safer open dataset for AI researchers while maintaining reproducibility for foundation model studies. (LAION)

Ai2’s small MoE model shows power of post-training

Ai2 released OLMoE, a Mixture-of-Experts model with 1.3 billion active parameters and 6.9 billion total parameters, trained on 5 trillion data-curated tokens. The model outperforms all open models in its active parameter range and responds well to fine-tuning, showing significant improvements with optimization techniques like KTO and DPO. OLMoE’s release includes intermediate training checkpoints, improved post-training mix, code, and training logs, all under the Apache 2.0 license. (Interconnects)

New protein design system could accelerate drug development

Google DeepMind introduced AlphaProteo, an AI system that designs novel, high-strength proteins and protein binders for biological and health research. The system achieved higher experimental success rates and 3 to 300 times better binding affinities than existing methods on seven target proteins. AlphaProteo’s ability to generate effective protein binders could accelerate progress in drug development and understanding the inner workings of diseases, reducing the time needed for experiments in these fields. (Google DeepMind)

Cohere updates and drops prices for its RAG-optimized models

Cohere unveiled upgraded versions of its Command R and Command R+ enterprise AI models, offering improvements in retrieval-augmented generation, multilingual support, and workflow automation. The new models feature enhanced performance in coding, math, reasoning, and latency, with Command R now matching the capabilities of the previous Command R+ version at a lower price point. Cohere priced the new Command R at $0.15 per million input tokens and $0.60 per million output tokens, while Command R+ costs $2.50 and $10.00 per million tokens for input and output, respectively. (Cohere)

Anthropic offers developer-friendly projects to jumpstart Claude-powered applications

Anthropic released a collection of quickstart projects to help developers build applications with the Anthropic API and Claude language model. The first project is a customer support agent that demonstrates Claude’s natural language capabilities for AI-assisted support systems. Developers can access these projects, which include setup instructions and resources, to quickly create customizable applications using Anthropic’s technology. (GitHub)

YouTube develops detection tools for synthetic content

YouTube is creating two new technologies to identify AI-generated content that mimics real people. One system will detect synthetic singing voices, allowing music partners to manage AI recreations of their vocals. The other will identify AI-generated depictions of people’s faces across various industries. These tools build on YouTube’s existing Content ID system, which has processed billions of copyright claims since 2007. (YouTube)


Still want to know more about what matters in AI right now? 

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng discussed how South Korea is well-positioned to become a strong AI hub, highlighting its local tech ecosystem, government support, and the wide range of opportunities across different industries:

“I’ve been consistently impressed by the thoughtful approach the Korean government has taken toward AI, with an emphasis on investment and innovation and a realistic understanding of risks without being distracted by science-fiction scenarios of harm.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: a new open weights model that generates tokens faster than current transformers, a study ranking large language models by their tendency to hallucinate during retrieval-augmented generation, Argentina’s new AI-powered national law-enforcement department that aims to detect, investigate, and predict crimes, and a new tool that makes large language models more explainable by probing every layer.


Subscribe to Data Points

Share

Subscribe to Data Points

Your accelerated guide to AI news and research