Replit Agent builds and deploys applications using natural language prompts DeepSeek-V2.5’s open model blends coding and chat

Published
Sep 9, 2024
Reading time
3 min read
Surreal scene of a figure attempting an AI jailbreak by hacking a glowing, futuristic padlock in a digital, sci-fi setting.

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

  • NVIDIA’s Blackwell chips impress on hardware tests
  • Fine-tuned versions of Llama 3.1 add reflection
  • Most AI jailbreaks may not amount to much
  • New architecture extends context windows to 100M tokens

But first:

Replit introduces AI-powered coding assistant for developers

Replit launched the Replit Agent, an AI alternative to an IDE that helps users build software projects using user-selected models and natural language prompts. The agent is available to Replit Core and Teams subscribers, currently in early access at no additional cost. Replit subscribers can access the Replit Agent through the web interface or mobile app, where they can describe their project ideas and collaborate with the AI to create applications from scratch. (Replit)

DeepSeek releases upgraded AI model with improved capabilities

DeepSeek unveiled DeepSeek-V2.5, an upgraded and blended version that combines the general and coding abilities of its previous V2 models. The new model, released under an Apache license, shows improved performance across various benchmarks, including AlpacaEval 2.0, ArenaHard, and HumanEval python, but loses some of its coding-specific performance. The 238 billion parameter model (with 16 billion parameters active on any given task) requires significant computational resources for inference, but offers developers multiple ways to integrate the model, including through Hugging Face’s Transformers and vLLM. (Hugging Face)

Updated MLPerf benchmark measures GPU performance and power consumption

MLCommons announced results for its latest MLPerf Inference benchmark suite, which measures machine learning hardware performance across various deployment scenarios. The latest release (version 4.1) introduced a new benchmark based on mixture of experts (MoE) model architecture and measured power consumption related to inference. NVIDIA’s new Blackwell chip took top marks for cloud solutions, while Untether AI led on the edge. MLPerf helps AI developers compare hardware performance, providing critical information for those procuring and tuning AI systems. (MLCommons)

Reflection-tuned version of Llama impresses on open model benchmarks

HyperwriteAI’s founder (with help from GlaiveAI) released Reflection Llama-3.1 70B, trained with a new technique called reflection tuning. Reflection tuning enables the system to recognize and correct mistakes in its reasoning before providing answers. Reflection Llama-3.1 70B outperforms the base version of Llama 3.1 70B and other open models on several benchmarks, including MMLU and MATH. A full report on the model’s capabilities and a 405 billion parameter version are expected later this week. (Hugging Face)

Detailed tests show most AI jailbreaks are less effective than reported

Researchers at UC-Berkeley developed a new benchmark called StrongREJECT to more accurately evaluate the effectiveness of AI jailbreaks, finding that many previously reported successful jailbreaks actually perform poorly. The benchmark includes a diverse set of 313 high-quality forbidden prompts and a state-of-the-art automated evaluator that aligns well with human judgments of jailbreak effectiveness. StrongREJECT revealed a “willingness-capabilities tradeoff” where jailbreaks that successfully bypass an AI’s safety measures often significantly degrade its ability to provide useful information. (BAIR/UC-Berkeley)

Experimental architecture significantly extends context windows

Magic introduced Long-Term Memory (LTM), an AI model architecture designed to reason on up to 100 million tokens of context during inference. LTM models use a sequence-dimension algorithm that is different from (and supposedly more efficient than) traditional attention mechanisms, allowing them to process ultra-long contexts with lower computational and memory requirements. The company’s first implementation, LTM-2-mini, shows potential for tasks like code generation, where access to extensive contextual information could improve performance. These longer context windows may enable AI models to leverage vastly more information during inference, leading to a shift from training on data to reasoning over a given set of information. (Magic)


Still want to know more about what matters in AI right now?

Read last week’s issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng discussed how South Korea is well-positioned to become a strong AI hub, highlighting its local tech ecosystem, government support, and the wide range of opportunities across different industries:

“Based on what I saw there in government, business, and academia, the nation is well positioned to become a strong AI hub. When he asked me if I would advise South Korea as a member of the Global AI Strategy Steering Group of the country’s National AI Committee, I agreed on the spot.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: a new open weights model that generates tokens faster than current transformers, a study ranking large language models by their tendency to hallucinate during retrieval-augmented generation, Argentina’s new AI-powered national law-enforcement department that aims to detect, investigate, and predict crimes, and a new tool that makes large language models more explainable by probing every layer.


Subscribe to Data Points

Share

Subscribe to Data Points

Your accelerated guide to AI news and research