High Gear for Llama 3.1 405B SambaNova boosts Llama 3.1 performance with fast, free access to largest model

Published
Reading time
2 min read
High Gear for Llama 3.1 405B: SambaNova boosts Llama 3.1 performance with fast, free access to largest model

SambaNova raised the speed limit for access to the largest model in the Llama 3.1 family — and it’s free.

What’s new: SambaNova launched a cloud service that runs Llama 3.1 405B significantly faster than competitors. A free tier is available, to be followed later this year by paid tiers that offer higher rate limits.

How it works: SambaNova uses proprietary chips and software to accelerate model inference.

  • The platform enables Llama 3.1 405B to generate 129 tokens per second (the fastest on the market) for $5/$10 per million input/output tokens. It enables Llama 3.1 70B to generate 411 tokens per second (behind Cerebras, which costs somewhat less) for $0.60/$1.20 per million input/output tokens, and Llama 3.1 8B to generate 998 tokens per second (also behind Cerebras, which offers a slightly lower price) for $0.10/$0.20 per million input/output tokens, according to Artificial Analysis. SambaNova’s own testing shows 132 tokens per second for Llama 3.1 405B and 461 tokens per second for Llama 3.1 70B.
  • Unlike some competitors, SambaNova runs Llama 3.1 at 16-bit precision (technically bf16/fp32 mixed precision). Models that process at lower precision can achieve higher speeds or run on less powerful hardware but lose accuracy. 

Yes, but: SambaNova currently limits Llama 3.1’s context window to around 8,000 tokens, much less than the model’s native 128,000 tokens.

Behind the news: The new service arrives amid a broader competition to deliver fast inference among cloud providers that have developed their own specialized chips. Competitors like Cerebras and Groq have introduced their own high-speed inference services.

Why it matters: Throughput, cost, performance, and latency are critical factors in practical applications of AI models. Fast inference allows for more frequent API calls without bogging down time to output, which is essential for agentic workflows and real-time decision making.

We’re thinking: Models with open weights are now served faster than proprietary models and are nearly as capable. This may spur further adoption of open models as well as prompting strategies, such as agentic workflows, that require large numbers of output tokens.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox