OpenAI’s o1 models recognize and fix mistakes Plus, explaining Reflection 70B’s replication controversy

Published

Sep 13, 2024

Reading time

3 min read

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

Copilot adds fine-tuning for faster code completion
DataGemma uses RAG and RIG for fact-retrieval
Mistral introduces its open multimodal model
Results of the latest summit on military AI

But first:

OpenAI releases new “Strawberry” models to solve STEM problems GPT-4o can’t

OpenAI announced o1, a new large language model family trained with reinforcement learning for difficult reasoning tasks. o1 employs a chain-of-thought approach, breaking down complex problems into simpler steps and learning to recognize and correct mistakes. It ranks in the 89th percentile on Codeforces, places among the top 500 U.S. students in the USA Math Olympiad qualifier, and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems. OpenAI has released an early version, o1-preview, for immediate use in ChatGPT and to trusted API users, and a smaller, less expensive version, o1-mini, also available in the API. (OpenAI)

“I got ahead of myself,” says Reflection 70B developer

HyperWrite claimed its new Reflection 70B model was a variant of Meta’s Llama 3.1, boasting superior performance to other open-source models. However, independent evaluators including Artificial Analysis questioned these claims, unable to reproduce HyperWrite's reported benchmark performances. Some evidence suggested Reflection 70B might actually be based on the older Llama 3; others speculated it could be a wrapper for Anthropic’s Claude. It’s also plausible that the public version had implementation errors. The controversy highlights the challenges in reproducing and verifying performance claims in the fast-moving open model landscape. (VentureBeat)

GitHub Copilot fine-tunes models for faster, customized code completion

GitHub introduced fine-tuned models for Copilot Enterprise, allowing organizations to customize the AI assistant with their proprietary codebases and coding practices. The new feature, available in limited public beta, offers more relevant and consistent code completion support tailored to each organization’s needs. The fine-tuning process uses the LoRA (Low-Rank Adaptation) method, which adjusts a subset of the most important model parameters for efficiency. Unlike previous retrieval-augmented generation (RAG) approaches, fine-tuning can enable Copilot to deliver contextualized suggestions with the speed necessary for real-time, inline coding. (GitHub)

Google tackles AI hallucinations with Data Commons integration

Google introduced DataGemma, a set of open models (based on Gemma 2 27B) designed to connect large language models with real-world data from Google’s Data Commons. The models use two approaches, Retrieval-Interleaved Generation (RIG) and Retrieval-Augmented Generation (RAG), to support accuracy and better reasoning in their responses. This development aims to address the challenge of AI hallucinations by grounding language models in trustworthy statistical information from reputable sources. (Google)

Mistral releases its first text and image multimodal model

French AI startup Mistral launched Pixtral 12B, a 12 billion parameter model that can process both images and text. The model, built on Mistral’s Nemo 12B, can answer questions about multiple images of any size and perform tasks like image captioning and object counting. Benchmark scores show the language model beats competing smaller models in multimodal reasoning and performance (ad measured by MMLU and ChartQA). Pixtral 12B is available for download and use under an Apache 2.0 license, allowing developers to fine-tune and implement the model without restrictions. (TechCrunch)

REAIM conference sets international guidelines for AI in tools of war

About 60 countries, including the United States, endorsed a “blueprint for action” for responsible use of artificial intelligence in military applications at a summit in Seoul. The document, which is not legally binding, builds on last year’s “call to action” and includes guidelines for risk assessments, human control, and measures to prevent AI from being used in weapons of mass destruction. China did not endorse the document, highlighting ongoing differences among stakeholders in the global discussion on military AI use. (Reuters)

Still want to know more about what matters in AI right now?

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng discussed why science-fiction scenarios of AI’s emergent behavior are likely to remain fictional.

“Some people fear that AI someday will learn to deceive humans deliberately. If that ever happens, I’m sure we will see it coming from far away and have plenty of time to stop it.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: Waymo highlighted its safety record, arguing that its autonomous vehicles are safer than human drivers on the same roads; 2D-to-3D mesh generation is becoming widely accessible for industries like gaming and animation; Western powers signed a legally binding AI treaty to regulate its impact on democracy and human rights; and a new automated method was developed to balance unbalanced datasets scraped from the web.

Subscribe to Data Points