Fierce competition among model makers and cloud providers drove down the price of access to state-of-the-art models.
What happened: AI providers waged a price war to attract paying customers. A leading indicator: From March 2023 to November 2024, OpenAI cut the per-token prices of cloud access to its models by nearly 90 percent even as performance improved, input context windows expanded, and the models became capable of processing images as well as text.
Driving the story: Factors that pushed down prices include open source, more compute-efficient models, and excitement around agentic workflows that consume more tokens at inference. OpenAI’s GPT-4 Turbo set a baseline when it debuted in late 2023 at $10.00/$30.00 per million tokens of input/output. Top model makers slashed prices in turn: Google and OpenAI at the higher end of the market, companies in China at the lower end, and Amazon at both. Meanwhile, startups with specialized hardware offered open models at prices that dramatically undercut the giants.
- Competitive models with open weights helped drive prices down by enabling cloud providers to offer high-performance models without bearing the cost of developing or licensing them. Meta released Llama 3 70B in April, and various cloud providers offered it at an average price of $0.78/$0.95 per million input/output tokens. Llama 3.1 405B followed in July 2024; Microsoft Azure priced it at almost half the price of GPT-4 Turbo ($5.33/$16.00).
- Per-token prices for open weights models tumbled in China. In May, DeepSeek released DeepSeek V2 and soon dropped the price to $0.14/$0.28 per million tokens of input/output. Alibaba, Baidu, and Bytedance slashed prices for Qwen-Long ($0.06/$0.06), Ernie-Speed and Ernie-Lite (free), and Doubau ($0.11/$0.11) respectively.
- Makers of closed models outdid one another with lower and lower prices. In May, OpenAI introduced GPT-4o at $5.00/$15.00 per million tokens of input/output, half as much as GPT-4 Turbo. By August, GPT-4o cost $2.50/$10.00 and the newer GPT-4o mini cost $0.15/$0.60 (half as much for jobs with slower turnaround times).
- Google ultimately cut the price of Gemini 1.5 Pro to $1.25/$5.00 per million input/output tokens (twice as much for prompts longer than 128,000 tokens) and slashed Gemini 1.5 Flash to $0.075/$0.30 per million input/output tokens (twice as much for prompts longer than 128,000 tokens). As of this writing, Gemini 2.0 Flash is free to use as an experimental preview, and API prices have not been announced.
- In December, Amazon introduced the Nova family of LLMs. At launch, Nova Pro ($0.80/$3.20 per million tokens of input/output) cost much less than top models from OpenAI or Google, while Nova Lite ($0.06/$0.24) and Nova Micro ($0.035/$0.14 respectively) cost much less than GPT-4o mini. (Disclosure: Andrew Ng serves on Amazon’s board of directors.)
- Even as model providers cut their prices, startups including Cerebrus, Groq, and SambaNova designed specialized chips that enabled them to serve open weights models faster and more cheaply. For example, SambaNova offered Llama 3.1 405B for $5.00/$10.00 per million tokens of input/output, processing a blazing 132 tokens per second. DeepInfra offered the same model at a slower speed for as little as $2.70/$2.70.
Yes, but: The trend toward more processing-intensive models is challenged but not dead. In September, OpenAI introduced token-hungry models with relatively hefty price tags: o1-preview ($15.00/$60.00 per million tokens input/output) and o1-mini ($3.00/$12.00). In December, o1 arrived with a more accurate pro mode that’s available only to subscribers who are willing to pay $200 per month.
Behind the news: Prominent members of the AI community pushed against regulations that threatened to restrict open source models, which played an important role in bringing down prices. Opposition by developers helped to block California SB 1047, a proposed law that would have held developers of models above certain size limits liable for unintended harms caused by their models and required a “kill switch” that would enable developers to disable them — a problematic requirement for open weights models that anyone could modify and deploy. California Governor Gavin Newsom vetoed the bill in October.
Where things stand: Falling prices are a sign of a healthy tech ecosystem. It’s likely that in-demand models will always fetch relatively high prices, but the market is increasingly priced in pennies, not dollars, per million tokens.