The High Cost of Serving LLMs How much does serving large language models at scale cost?

Published
Reading time
1 min read
The High Cost of Serving LLMs: How much does serving large language models at scale cost?

Amid the hype that surrounds large language models, a crucial caveat has receded into the background: The current cost of serving them at scale.

What’s new: As chatbots go mainstream, providers must contend with the expense of serving sharply rising numbers of users, the Washington Post reported.

The price of scaling: The transformer architecture, which is the basis of models like OpenAI’s ChatGPT, requires a lot of processing. Its self-attention mechanism is computation-intensive, and it gains performance with higher parameter counts and bigger training datasets, giving developers ample incentive to raise the compute budget.

  • Hugging Face CEO Clem Delangue said that serving a large language model typically costs much more than customers pay.
  • SemiAnalysis, a newsletter that covers the chip market, in February estimated that OpenAI spent $0.0036 to process a GPT-3.5 prompt. At that rate, if Google were to use GPT-3.5 to answer the approximately 320,000 queries per second its search engine receives, its operating income would drop from $55.5 billion to $19.5 billion annually.
  • In February, Google cited savings on processing as the reason it based its Bard chatbot on a relatively small version of its LaMDA large language model.
  • Rising demand for chatbots means a greater need for the GPU chips that often process these models at scale. This demand is driving up the prices of both the chips and cloud services based on them.

Why it matters: Tech giants are racing to integrate large language models into search engines, email, document editing, and an increasing variety of other services. Serving customers may require taking losses in the short term, but winning in the market ultimately requires balancing costs against revenue.

We’re thinking: Despite the high cost of using large language models to fulfill web searches — which Google, Bing, and Duckduckgo do for free, thus creating pressure to cut the cost per query — for developers looking to call them, the expense looks quite affordable. In our back-of-the-envelope calculation, the cost to generate enough text to keep someone busy for an hour is around $0.08.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox