OpenAI and Google’s fine-tuning disappoints Alibaba’s latest Qwen open source code model matches GPT-4

Published
Nov 15, 2024
Reading time
3 min read
A court scene with a humanoid robot judge presiding over a courtroom.

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

  • MLPerf tests show Nvidia’s Blackwell chips excel at training models
  • Google releases AlphaFold 3 code and parameters (with restrictions)
  • DeepSeek’s two-way image model gets even better
  • Judge tosses copyright lawsuit against OpenAI

But first:

Study reveals knowledge gaps when using commercial fine-tuning APIs

Researchers at Stanford introduced FineTuneBench, an evaluation framework to assess the effectiveness of commercial large language model (LLM) fine-tuning APIs in learning new information and updating existing knowledge. The study tested five powerful LLMs, including GPT-4 and Gemini 1.5 Pro, finding significant limitations in their ability to learn through fine-tuning. The models showed an average generalization accuracy of 37 percent for new information and 19 percent for updating existing knowledge, with Gemini 1.5 falling well short of GPT-4. These findings highlight a critical gap in the current capabilities of commercial fine-tuning services, potentially impacting their reliability for knowledge infusion in real-world applications. (arXiv)

Open source Qwen2.5-Coder wows on coding benchmarks

Alibaba released Qwen2.5-Coder, a series of code-specific large language models available in six sizes ranging from 0.5 to 32 billion parameters, all under an Apache 2.0 license. The largest model, Qwen2.5-Coder-32B, claims state-of-the-art performance among open-source code models, with capabilities matching GPT-4 for coding tasks. Qwen2.5-Coder boasts improvements in code generation, reasoning, and fixing, and supports context lengths up to 128,000 tokens. (GitHub)

Tech giants showcase AI chip advances in latest benchmark tests

Nvidia, Google, and other tech companies reported results from the latest MLPerf v4.1 benchmark tests, showcasing performance improvements in AI training tasks. Nvidia’s next-generation B200 GPU doubled performance on some tests compared to its current H100 chip, while Google’s new Trillium accelerator showed up to a 3.8-fold boost over its predecessor. The benchmarks, which include tasks like training large language models and image generation, help AI developers assess the capabilities of different hardware platforms for machine learning workloads. (ML Commons)

Google releases AlphaFold 3 code and access instructions

Google released the implementation code for AlphaFold 3’s inference pipeline, along with instructions for requesting access to the model parameters. Researchers must cite the “Accurate structure prediction of biomolecular interactions with AlphaFold 3” paper when publishing findings after using the code, parameters, or outputs. Google will grant access to the model parameters at its discretion, with researchers required to adhere to specific terms of use. Google had initially withheld access to the biochemical model’s code and parameters from other researchers, leading to an outcry that it was limiting the model’s usefulness and making it difficult for other researchers to replicate Google’s results. (GitHub)

DeepSeek updates Janus multimodal model with rectified flow

DeepSeek released JanusFlow, a new AI system that can both understand and generate images using a single model. The system (an update of DeepSeek’s earlier Janus model) performs as well as or better than specialized models designed for only one task, while also surpassing other multi-purpose models in standard tests. DeepSeek made JanusFlow available for public use under an MIT license (including commercial applications), which could speed up research and development for multimodal AI. (GitHub)

Judge dismisses copyright lawsuit against OpenAI over training data

A New York federal judge dismissed a lawsuit against OpenAI brought by Raw Story Media and Alternet Media over the use of their content to train AI models. The judge ruled that removing copyright management information from articles for AI training, without disseminating those works, does not constitute concrete injury needed to establish legal standing. This decision could impact similar lawsuits against AI companies, potentially guiding how courts view the use of copyrighted material in AI training datasets. (Bloomberg Law)


Still want to know more about what matters in AI right now?

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng shared his thoughts on optimizing large language models (LLMs) for agentic workflows, particularly how advancements like function calling and native computer use are transforming how LLMs support complex, iterative applications.

“Most LLMs have been optimized for answering questions primarily to deliver a good consumer experience, and we’ve been able to ‘graft’ them into complex agentic workflows to build valuable applications. The trend of LLMs built to support particular operations in agents natively will create a lot of lift for agentic performance.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: OpenHands launches Free Agents, an open toolkit for advanced code generation and automation; Perplexity introduced Election Hub, an AI-powered experience providing voters with verified, real-time news and insights on U.S. politics; Meta and Anthropic explore opportunities for AI in U.S. defense and national security, pursuing major military contracts; and Hunyuan-Large surpasses other open competitors with impressive benchmark scores, showcasing the potential of Mixture of Experts models.


Subscribe to Data Points

Share

Subscribe to Data Points

Your accelerated guide to AI news and research