Dear friends,
Fine-tuning small language models has been gaining traction over the past half year. I’d like to share my sense of when to use this technique, and also when not to, based on what I’m seeing in multiple companies.
First, while fine-tuning is an important and valuable technique, many teams that are currently using it probably could get good results with simpler approaches, such as prompting (including writing mega prompts), few-shot prompting, or simple agentic workflows.
Why shouldn’t these teams be fine-tuning? Because fine-tuning, which takes a pre-trained model and further trains it on data specific to an application, is relatively complex to implement. You need to collect training data, then (unless you want to implement fine-tuning yourself) find a provider to help with running fine-tuning, then find a way to deploy the fine-tuned model. Because it adds extra complexity both in training and deployment, usually I resort to this technique only after I find that prompting and simple agentic workflows are not up to a task.
Having said that, there are also applications where fine-tuning is appropriate and valuable. LoRA (which learns by modifying a limited number of parameters rather than the entire model) and related methods have made fine-tuning quite affordable, particularly for small models (say, 13B or fewer parameters). And the amount of data needed to get started is less than most people think. Depending on the application, I’ve seen good results with 100 or even fewer examples. Here are a few applications where I have seen fine-tuning applied successfully:
Improving accuracy of critical applications. Prompting can get you really far for many applications. But sometimes, fine-tuning helps eke out that last bit of accuracy. For example, if you are building a customer service chatbot and need it to call the right API reliably (say, to carry out transactions, issue refunds, and the like), perhaps prompting can get it to make the right API call 95% of the time. But if you struggle to raise the accuracy even with revisions to the prompt and you really need 99% accuracy, fine-tuning on a dataset of conversations and API calls might be a good way to get you there. This is particularly true for tasks where it's hard to specify, using only language, an unambiguous rule to decide what to do. For example, when a customer is frustrated, should the chatbot escalate to a manager or just issue a refund? Teams often write Standard Operating Procedures (SOPs) for human workers to follow, and these SOPs can go into the prompts of models. But if it is hard to specify an unambiguous SOP, so even humans need to see numerous examples before they can learn what to do, fine-tuning can be a good approach. For many text-classification applications fine-tuning also works well, for example, classifying medical records into diagnosis and procedure codes for health insurance claims.
Learning a particular style of communication. As I explain in “Generative AI for Everyone,” my team fine-tuned a model to sound like me. Many people (including myself) have idiosyncratic uses of language. There are certain words I tend to say and others I tend not to, and these idiosyncrasies are numerous and very difficult to specify in a text prompt. (By the way, the avatar at deeplearning.ai/avatar, built with RealAvatar, uses fine-tuning for this reason.) To get a system to communicate in a certain style, fine-tuning is often a superior solution to prompting alone.
Reducing latency or cost during scale-ups. I’ve seen applications where developers have successfully prompted a large model to perform a complex task. But as usage scales up, if the large model is too slow (which often happens) or too expensive (which also happens but less frequently), the team might want to use a smaller model. If, however, the performance of the smaller model isn't good enough, then fine-tuning it can help bring it up to the performance of the larger one for that narrow application. Further, the larger model (or perhaps an agentic workflow) can also be used to generate data to help with fine-tuning the small model for that task.
At the cutting edge of research, some teams are fine-tuning models to get better at a certain language. But with few exceptions, if the goal is to get an LLM to better understand a body of knowledge that is not in its training data, I find that using RAG (retrieval augmented generation) is a much simpler approach, and I still occasionally run into teams using fine-tuning for which I think RAG would work better.
Overall my sense is that, of all the teams I see using fine-tuning, perhaps 75% could get good results using simpler techniques (like prompting or agentic workflows), but in 25% of cases I know of no better way to achieve their goal.
It is still technically challenging to implement fine-tuning, get the hyperparameters right, optimize the compute resources, and so on. We are lucky that more and more companies have worked hard to optimize these and provide efficient fine-tuning services. Many of them allow us to fine-tune open weights models and also download the fine-tuned weights. Some allow us to fine-tune their closed models and continue to keep the tuned weights closed. Both can be useful, but the former has obvious advantages of portability and not having to worry that the provider will stop serving a particular model, causing a critical component in our software to become deprecated.
In conclusion, before fine-tuning, consider if you should be trying just a bit harder with prompting or agentic workflows, which can lead to simpler solutions that are easier to maintain. The vast majority of applications my teams build do not use any fine-tuning at all, but it’s a critical piece of a small minority of them.
Keep learning!
Andrew