Research aims to help users select large language models that minimize expenses while maintaining quality.
What's new: Lingjiao Chen, Matei Zaharia, and James Zou at Stanford proposed FrugalGPT, a cost-saving method that calls pretrained large language models (LLMs) sequentially, from least to most expensive, and stops when one provides a satisfactory answer.
Key insight: In many applications, a less-expensive LLM can produce satisfactory output most of the time. However, a more-expensive LLM may produce satisfactory output more consistently. Thus, using multiple models selectively can save substantially on processing costs. If we arrange LLMs from least to most expensive, we can start with the least expensive one. A separate model can evaluate its output, and if it’s unsatisfactory, another algorithm can automatically call a more expensive LLM, and so on.
How it works: The authors used a suite of 12 commercial LLMs, a model that evaluated their output, and an algorithm that selected and ordered them. At the time, the LLMs’ costs (which are subject to change) spanned two orders of magnitude: GPT-4 cost $30/$60 per 1 million tokens of input/output, while GPT-J hosted by Textsynth cost $0.20/$5 per 10 million tokens of input/output.
- To classify an LLM’s output as satisfactory or unsatisfactory, the authors fine-tuned separate DistilBERTs on a diverse selection of datasets: one that paired news headlines and subsequent changes in the price of gold, another that labeled excerpts from court documents according to whether they rejected a legal precedent, and a third dataset of questions and answers. Given an input/output pair (such as a question and answer), they fine-tuned DistilBERT to produce a high score if the output was correct and low score if it wasn’t. The output was deemed satisfactory if its score exceeded a threshold.
- A custom algorithm (which the authors don’t describe in detail) learned to choose three LLMs and put them in order. For each dataset, it maximized the percentage of times a sequence of three LLMs generated the correct output within a set budget.
- The first LLM received an input. If its output was unsatisfactory, the second LLM received the input. If the second LLM’s output was unsatisfactory, the third LLM received the input.
Results: For each of the three datasets, the authors found the accuracy of each LLM. Then they found the cost for FrugalGPT to match that accuracy. Relative to the most accurate LLM, FrugalGPT saved 98.3 percent, 73.3 percent, and 59.2 percent, respectively.
Why it matters: Many teams choose a single model to balance cost and quality (and perhaps speed). This approach offers a way to save money without sacrificing performance.
We're thinking: Not all queries require a GPT-4-class model. Now we can pick the right model for the right prompt.