Pretrained language models like GPT-3 have shown notable proficiency in few-shot learning. Given a prompt that includes a few example questions and answers (the shots) plus an unanswered question (the task), such models can generate an accurate answer. But there may be more to getting good results.
What’s new: Ethan Perez, Douwe Kiela, and Kyunghyun Cho subjected GPT-style language models to a test they call true few-shot learning. They found that the heralded few-shot success may depend on a well engineered prompt. The authors are based at New York University, Facebook, and CIFAR, respectively.
Key insight: Training a machine-learning model typically requires a validation set to tune hyperparameters such as the learning rate. For GPT-style models, those hyperparameters include the prompt format. In few-shot learning with a pretrained model, the prompt typically contains a handful of examples. However, researchers often experiment extensively to find a prompt format that yields accurate responses. This amounts to stacking the deck in the model’s favor, and without it, such models can’t perform so well.
How it works: The authors evaluated four sizes of GPT-3, four sizes of GPT-2, and DistilGPT-2. They tested prompt formats from LAMA, a benchmark that comprises factual statements in a variety of formats, and LPAQA, which contains LAMA statements translated from English into a different language and back.
- LAMA provides statements in 41 categories, such as “X was born in Y,” where X is a personal name and Y is a place, and “X was created by Y,” where X is the name of a company and Y is the name of a product. It presents each statement in an average of 12 formats. For instance, “X was created by Y” is also formatted “X is developed by Y” and “X is being developed by Y.”
- The authors assembled prompts made of five such statements, all in the same category and format, in which the last word was missing, such as, “The iPhone is being developed by _.” The missing word is, of course, “Apple.” They provided versions of these prompts in all 120 possible orders of the five statements, always with the final word missing, prompting the model to fill in the blank.
- They used cross-validation to find the prompt format that, given four complete and one incomplete examples, prompted the best performance on average across all formats and categories.
- For each model, they compared performance prompted by the best format according to cross-validation, the format associated with the highest accuracy on the test set, and the mean accuracy on the test set across all formats and categories.
Results: For all models tested, the accuracy prompted by the format selected according to cross-validation was only marginally above the mean and significantly below the accuracy of the best format. For instance, for the largest model (GPT-3 with 175 billion parameters), the format chosen by cross-validation scored about 55 percent, mean accuracy was about 54 percent, and the accuracy of the best format was about 60 percent.
Why it matters: Previous claims of few-shot learning in GPT-style models left out an important variable: the size of the dataset used to pick a good format. Choosing among 12 prompt formats boosted accuracy by around 5 percent; choosing among a larger set of formats could make a bigger difference. If researchers don’t include all the information that went into the results they report, follow-up studies are unlikely to duplicate their work.
We’re thinking: We like prompt engineering that gets things done on time. We’re less enamored with prompt engineering that muddies the water around few-shot learning.