The LLM Will See You Now AMIE, a chatbot that outperforms doctors in diagnostic conversations

Published

Jun 12, 2024

Reading time

3 min read

A critical step in diagnosing illnesses is a conversation between doctor and patient to assemble a medical history, discuss approaches to managing symptoms, and so on. Can a large language model play the doctor’s role? Researchers trained one to do surprisingly well.

What's new: Articulate Medical Intelligence Explorer (AMIE), a chatbot built by Google researchers Tao Tu, Anil Palepu, Mike Schaekermann and colleagues, showed better diagnostic ability and bedside manner than doctors in conversations with patients. The conversations covered a range of complaints including cardiovascular, respiratory, gastroenterology, neurology, urology, obstetric, and gynecology conditions.

Key insight: A pretrained LLM that’s fine-tuned on conversations between doctors and patients can learn to mimic the doctor’s role. However, such models are limited because available datasets of real-world medical conversations don’t cover the full range of medical scenarios and include ambiguities, interruptions, implicit references and the like, posing difficulties for learning. Conversations generated by a pretrained LLM can cover more conditions in more articulate language. After fine-tuning on real-world conversations, further tuning on generated conversations can improve performance. In addition, after a conversation, critiquing the “doctor’s” performance can improve its ability to render diagnoses, suggest plans for managing symptoms, empathize with patients, and otherwise perform its role.

How it works: The authors fine-tuned a pretrained PaLM-2 on medical multiple-choice questions that describe symptoms, possible causes, and evidence for the correct diagnosis, as well as datasets for tasks like summarizing and continuing medical dialogs. They further fine-tuned the model on its own output.

Given a medical condition, the authors searched the web to retrieve background information about symptoms, management, and patient demographics. Using that information, they prompted PaLM-2 to generate a patient scenario like scenarios used to assess real-world medical interviewing skills.
The authors prompted separate instances of PaLM-2 to play doctor and patient. They fed the generated scenario to the patient and prompted the models to produce a conversation. After each turn, a third instance of PaLM-2 decided whether the conversation was over based on whether the doctor had given a diagnosis and the patient had further questions (or either had said “goodbye”).
Given the generated conversation, a fourth instance of PaLM-2 generated a critique of the doctor model’s empathy, professionalism, repetition, conversation flow, factual accuracy, and whether the doctor had asked questions that led to a diagnosis.
Given the critique, the doctor initiated a second iteration of its conversation with the patient.
The authors fine-tuned PaLM-2 to predict the next token in the second conversation. Then they repeated the process from the beginning a number of times, generating fresh conversations and fine-tuning the model.
At inference, users conversed with the doctor model. Once the conversation was complete, the authors prompted the model to list 10 potential diagnoses.

Results: Specialist physicians evaluated the doctor model’s performance in 149 conversations with human actors who played the roles of patients based on scenarios supplied by clinical providers. They compared the model’s output with those of 20 primary care physicians based on their own conversations with the actors.

The model included the correct diagnosis among its top three in about 90 percent of cases. The physicians included the correct diagnoses among their top three in 77 percent of the scenarios.
Specialist physicians also rated the conversations on 32 subjective qualities including relationship fostering, responding to emotions, understanding patient concerns, and explaining relevant information accurately. Of the 32 qualities, AMIE rated higher on 28 of them. For instance, the physicians said AMIE responded to emotions favorably or very favorably about 83 percent of the time, while physicians responded to emotions favorably or very favorably 31 percent of the time.
The actors also rated the conversations they had with AMIE and the physicians on 26 qualities including whether they had explained the condition and treatment, appeared honest and trustworthy, expressed caring and commitment, and valued the patient as a person. Among those 26 qualities, AMIE outperformed the physicians on 24 of them. For instance, the actors said that AMIE valued them as people 79 percent of the time, while the physicians valued them as people 59 percent of the time.

Why it matters: LLMs can generate fine-tuning data that improves their own performance. By training on relevant, factually correct medical information from the web, LLMs can generate realistic conversations at scale — even in a highly technical, high-stakes discipline like medicine and despite their potential to generate potentially dangerous hallucinations. Used as fine-tuning data, this output enables LLMs to converse with humans more effectively.

We're thinking: AI promises to spread intelligence far and wide. As the authors acknowledge, further work remains to demonstrate this work’s efficacy, ethics, security, and regulatory compliance in a clinical setting. Yet it’s an exciting glimpse of a world in which medical intelligence is fast, cheap, and widely available.

Subscribe to The Batch