Literary works are challenging to translate. Their relative length, cultural nuances, idiomatic expressions, and expression of an author’s individual style call for skills beyond swapping words in one language for semantically equivalent words in another. Researchers built a machine translation system to address these issues.
What’s new: Minghao Wu and colleagues at Monash University, University of Macau, and Tencent AI Lab proposed TransAgents, which uses a multi-agent workflow to translate novels from Chinese to English. You can try a demo here.
Key insight: Prompting a large language model (LLM) to translate literature often results in subpar quality. Employing multiple LLMs to mimic human roles involved in translation breaks down this complex problem into more tractable parts. For example, separate LLMs (or instances of a single LLM) can act as agents that take on roles such as translator and localization specialist, and they can check and revise each other’s work. An agentic workflow raises unsolved problems such as how to evaluate individual agents’ performance and how to measure translation quality. This work offers a preliminary exploration.
How it works: TransAgents prompted pretrained LLMs to act like a translation company working on a dataset of novels. The set included 20 Chinese novels, each containing 20 chapters, accompanied by human translations into English.
- GPT-4 Turbo generated text descriptions of 30 workers. Each description specified attributes such as role, areas of specialty, education, years of experience, nationality, gender, and pay scale. The authors prompted 30 instances of GPT-4 Turbo to take on one of these personas. Two additional instances acted as the company’s CEO and personnel manager (or “ghost agent” in the authors’ parlance).
- Given a project, the system assembled a team. First it prompted the CEO to select a senior editor, taking into account the languages and worker profiles. The personnel manager evaluated the CEO’s choices and, if it determined they were suboptimal, prompted the CEO to reconsider. Then the system prompted the CEO and senior editor to select the rest of the team, talking back and forth until they agreed on a junior editor, translator, localization specialist, and proofreader.
- Next the system generated a guide document to be included in every prompt going forward. The junior editor generated and the senior editor refined a summary of each chapter and a glossary of important terms and their translations in the target language. Given the chapter summaries, the senior editor synthesized a plot summary. In addition, the senior editor generated guidelines for tone, style, and target audience using a randomly chosen chapter as reference.
- The team members collaborated to translate the novel chapter by chapter. The translator proposed an initial translation. The junior editor reviewed it for accuracy and adherence to the guidelines. The senior editor evaluated the work so far and revised it accordingly. The localization specialist adapted the text to fit the audience’s cultural context. The proofreader checked for language errors. Then the junior and senior editors critiqued the work of the localization specialist and proofreader and revised the draft accordingly.
- Finally, the senior editor reviewed the work, assessing the quality of each chapter and ensuring smooth transitions between chapters.
Results: Professional translators compared TransAgents’ output with that of human translators and GPT-4 Turbo in a blind test. One said TransAgents “shows the greatest depth and sophistication,” while another praised its “sophisticated wording and personal flair” that “effectively conveys the original text’s mood and meaning.”
- Human judges who read short translated passages without referring to the original texts, preferred TransAgents’ output, on average, to that of human translators and GPT-4 Turbo, though more for fantasy romance novels (which they preferred 77.8 percent of the time) than science fiction (which they preferred 39.1 percent of the time).
- GPT-4 Turbo, which did refer to the original texts while comparing TransAgents’ translations with the work of human translators and its own translations, also preferred TransAgents on average.
- TransAgents’ outputs were not word-by-word translations of the inputs but less-precise interpretations. Accordingly, it fared poorly on d-BLEU, a traditional measure that compares a translation to a reference text (higher is better) by comparing sequences of words. TransAgents achieved a d-BLEU score of 25, well below GPT-4 Turbo's 47.8 and Google Translate's 47.3.
Why it matters: While machine translation of ordinary text and conversations has made great strides in the era of LLMs, literary translation remains a frontier. An agentic workflow that breaks down the task into subtasks and delegates them to separate LLM instances makes the task more manageable and appears to produce results that appeal to human judges (and an LLM as well). That said, this is preliminary work that suggests a need for new ways to measure the quality of literary translations.
We’re thinking: Agentic workflows raise pressing research questions: What is the best way to divide a task for different agents to tackle? How much does the specific prompt at each stage affect the final output? Good answers to questions like this will lead to powerful applications.