Qwen-2VL shines on vision tasks Cerebras processes up to 1,800 tokens per second

Published

Aug 30, 2024

Reading time

3 min read

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

Claude’s system prompts go public
Google updates Gemini 1.5 models
Magic demos 100 million token context window
OpenAI and Anthropic give U.S. government model access

But first:

Alibaba’s new Qwen2-VL model claims to outperform GPT-4 on some vision tasks

Alibaba released Qwen2-VL, an updated version of its vision language model based on the Qwen2 language model family. Qwen2-VL is available as 2 billion and 7 billion parameter versions under an Apache 2.0 license, as well as a 72 billion parameter API version. The 72B version of Qwen2-VL reportedly outperforms GPT-4 and Claude 3.5 on several benchmarks, including MathVista, DocVQA, and RealWorldQA, while the 7B version achieves state-of-the-art performance on document understanding tasks – giving AI developers multiple options to incorporate advanced vision-language capabilities into their applications. (GitHub)

AI inference speeds up dramatically with new Cerebras offering

Cerebras Systems launched Cerebras Inference, a new AI inference solution that processes up to 1,800 tokens per second for Llama3.1 8B and 450 tokens per second for Llama3.1 70B, outperforming NVIDIA’s GPU performance by up to 20 times. The system maintains 16-bit accuracy throughout inference runs and offers pricing starting at 10 cents per million tokens. This jump in inference speed may enable developers to more easily build more complex, real-time AI applications and agents. (Cerebras)

Anthropic releases system prompts for Claude chatbots

Claude’s prompts instruct the model to encourage preferred behaviors, such as using step-by-step reasoning for math and logic tasks, avoiding recognizing human faces, or noting when its answers might be uncertain due to limited knowledge. System prompts apply only to the chatbots on Claude’s website or its mobile apps. All chatbots use a system prompt, but Anthropic has disclosed its prompts to be as transparent as possible about how its model interacts with users. (Anthropic)

Google updates Gemini models in its API for developers to test

Google introduced experimental versions of its Gemini API models, allowing developers to test new features and provide feedback. The models include updated versions of Gemini 1.5 Pro and Gemini 1.5 Flash, plus a smaller, eight billion parameter version of Gemini 1.5 Flash. The models outperform their predecessors on internal benchmarks, and Gemini 1.5 Flash 8B is unusually fast and capable for a smaller model. (Google)

New AI architecture processes 100 million tokens of context

Magic introduced LTM (Long-Term Memory), an AI model architecture designed to reason on up to 100 million tokens of context during inference. LTM models use a sequence-dimension algorithm that is significantly more efficient than traditional attention mechanisms, allowing them to process ultra-long contexts with lower computational and memory requirements. The company’s first implementation, LTM-2-mini, demonstrates the potential of this approach for tasks like code synthesis, where access to extensive contextual information could greatly improve performance. These longer context windows could enable AI models to leverage vastly more information during inference, potentially leading to a shift from training on data to reasoning over a specific, given set of information. (Magic)

NIST gains early access to Anthropic and OpenAI models for safety testing

The U.S. AI Safety Institute (part of the National Institute of Standards and Technology, or NIST) signed agreements with Anthropic and OpenAI to collaborate on AI safety research, testing, and evaluation. These agreements allow the institute to access major new models from both companies before and after public release, enabling research to evaluate the models’ capabilities, assess safety risks, and develop mitigation strategies. The partnerships build on earlier voluntary commitments from leading AI model developers and the Biden-Harris administration’s Executive Order on AI. (NIST)

Still want to know more about what matters in AI right now?

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng discussed how top language models’ falling token prices are leading to new opportunities for developers:

“When building applications, I find it useful to design to where the technology is going rather than only where it has been. Based on the technology roadmaps of multiple software and hardware companies — which include improved semiconductors, smaller models, and algorithmic innovation in inference architectures — I’m confident that token prices will continue to fall rapidly.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: expansion of the AI lobby, Genie’s new coding agent, how a language model and brain implants helped an ALS patient regain his speech, and a new paper on 4M-21, a multimodal model developed by researchers at Apple and EPFL.

Subscribe to Data Points