Long Context Gets Up to Speed AI21 Labs’ Jamba 1.5 outpaces transformers in long-text processing

Published

Sep 04, 2024

Reading time

3 min read

A new open weights model generates tokens faster than current transformers, especially when processing long inputs.

What’s new: AI21 Labs released Jamba 1.5, an update of its earlier Jamba. It comes in Mini and Large versions and boasts a relatively large (and validated) input context length of 256,000 tokens. The model weights are free to users who have annual recurring revenue under $50 million and available on several cloud platforms including Google Cloud Vertex AI, Hugging Face, and Microsoft Azure.

How it works: Jamba 1.5 is a hybrid architecture made up of transformer, mamba, and mixture of experts (MoE) layers. Unlike transformer layers, in which processing power scales quadratically as input length increases, the mamba layers enable the required processing power to scale linearly as input length increases without requiring workarounds like sparse attention and sliding windows. The MoE layers are composed of many fully connected sublayers, of which only a small number are used to process a given input. Jamba 1.5 Mini has roughly 50 billion parameters but uses only 12 billion at a time, while Jamba 1.5 Large has around 400 billion parameters but uses only 94 billion at a time.

The authors pretrained Jamba 1.5 on a proprietary dataset of web documents, code, books, and scientific articles. They further pretrained it on a higher proportion of longer documents to increase its ability to process long-text inputs.
They fine-tuned Jamba 1.5 on generated data to handle specific types of input such as instructions, conversations, longer documents, question-answer pairs, and calls to external tools.
Unlike transformer-based models, Jamba 1.5 showed no benefit from positional embeddings of input tokens, so it doesn’t use them.

Results: Both versions of Jamba 1.5 produced output tokens faster than other models (running on identical hardware), especially given longer inputs. However, the larger version achieved lower performance on popular benchmarks than other open models.

With 262,144 tokens as input, Jamba 1.5 Mini generated about 62 tokens per second, LLaMA 3.1 8B generated about 41, and Mixtral generated about 39. The difference became narrower as input length decreased. With 4,096 tokens as input, Jamba 1.5 Mini generated around 78 tokens per second, LLaMA 3.1 8B generated about 79, and Mixtral 8x7B generated about 60.
Both models performed extraordinarily well on RULER, a suite of 13 tasks that assess the ability of large language models to take advantage of input context at various lengths. Jamba 1.5 Mini and Large utilized their full context length, while many competing models utilized half or less.
Across 11 popular benchmarks, Jamba 1.5 Mini performed similarly to LLaMA 3.1 8B and Gemma 2 9B. However, Jamba 1.5 Large achieved lower performance than LLaMA 3.1 70B and Mistral Large 2 123B on nearly every benchmark.

Behind the news: The mamba architecture, which is designed to enable processing to scale linearly with longer input lengths, has been a subject of much research since its release in late 2023. Notably, Mamba-2, Mamba-2-Hybrid, and Zamba combined mamba layers with attention layers with varying degrees of success.

Why it matters: The original Mamba model was much faster and equally accurate compared to transformers up to 2.8 billion parameters. But how the mamba architecture compared to transformers at larger scales was an open question. Jamba 1.5 shows that the combination of mamba and transformer layers can yield higher speed in larger models — although the results don’t yet exceed those of comparably sized transformers.

We’re thinking: While hardware companies like Groq and SambaNova are accelerating LLMs, software innovations like Jamba may enable further speed-ups.

Subscribe to The Batch