Researchers increasingly fine-tune models on synthetic data, but generated datasets may not be sufficiently diverse. New work used agentic workflows to produce diverse synthetic datasets.
What’s new: Arindam Mitra, Luciano Del Corro, Guoqing Zheng, and colleagues at Microsoft introduced AgentInstruct, a framework for producing synthetic data for fine-tuning large language models (LLMs).
Key insight: To generate synthetic data for fine-tuning, researchers typically prompt an LLM to generate responses (and possibly further prompts) using a selection of existing prompts. While training on the resulting dataset can improve model performance, the synthetic data’s distribution may not match that of real-world data, yielding inconsistent performance. A more methodical approach can generate data closer to the real-world distribution: First generate prompts from each example in a large, diverse dataset, then generate responses.
How it works: The authors generated a synthetic text dataset based on three unlabeled datasets (including code) scraped from the web. They generated new examples for 17 tasks, including natural language tasks like reading comprehension and word puzzles as well as coding, tool use, and estimating measurements.
- Using an unspecified LLM, they generated prompts (text plus an instruction) using three agentic workflows they called content transformation (which created variations on the text that offer wider latitude for generating instructions), instruction generation, and instruction refinement (which made the instructions more complicated or unsolvable).
- For each task, they manually defined a team of agents to perform each workflow. For example, for the reading comprehension task, content transformation agents transformed raw text into a poem, satire, or other stylistic or formal variation. Instruction generation agents generated questions to ask about the transformed text based on an author-defined list of 43 types of questions. Instruction refinement agents received each (text, question) pair and produced more pairs by either (i) modifying the passage to make the question unanswerable, (ii) modifying the passage so the correct answer became the opposite of the original answer, or (iii) modifying the questions to be more complicated or unanswerable.
- The authors combined the resulting 22 million (text, instruction) prompts with prompts used to train Orca-1, Orca-2, and Orca-Math, for a total of 25.8 million prompts. Then they generated responses and fine-tuned Mistral-7B on the resulting dataset. They called the resulting model Orca-3.
Results: The authors compared Orca 3’s performance against that of competitors on 14 benchmarks. Orca 3 outperformed Mistral-7B (fine-tuned on prompts from previous versions of Orca) and Mistral-7B-Instruct (fine-tuned to respond to instructions) on 13 benchmarks. In some cases, it did so by large margins; for instance 40 percent on AGIEVAL, 54 percent on GSM8K, and 19 percent on MMLU. Orca 3 fell short of GPT-4 on 12 benchmarks.
Why it matters: The authors defined agentic workflows that turn text into diverse data for fine-tuning models. Their framework offers a pattern for AI engineers who want to build synthetic datasets for other tasks.
We’re thinking: We’re excited to see agentic workflows find applications that a wide variety of AI developers might put to use!