Landmark test of clinical reasoning finds AI outperformed physicians

Harvard Medical School | 04-30-2026
Artificial intelligence concept.
Landmark test of clinical reasoning finds AI outperformed physicians. Credit: © r.Hilch – Depositphotos

In one of the largest studies to compare artificial intelligence and physicians on a wide array of clinical reasoning tasks including real emergency department data, a team of physicians and computer scientists at Harvard Medical School and Beth Israel Deaconess Medical Center evaluated whether an AI system could do what physicians do every day: review a messy patient chart and use that information to determine diagnosis and next steps.

In a new study published April 30, 2026 in Science, co-senior authors Arjun (Raj) Manrai, assistant professor of biomedical informatics at HMS and Adam Rodman, MD, MPH, a hospitalist and clinical researcher at BIDMC and team report that a large language model (LLM) outperformed physicians across many common clinical reasoning tasks including emergency room decisions, identifying likely diagnoses, and choosing next steps in management.

The LLM’s performance indicated that long‑standing ways of testing medical AI may no longer capture current systems’ performance, pointing to a possible turning point for the field.

”We tested the AI model against virtually every benchmark, and it eclipsed both prior models and our physician baselines,” said co-senior author Manrai. ”However, this does not mean AI will necessarily improve care—how and where it should be deployed remain understudied, and we desperately need rigorous prospective trials to evaluate the impact of AI on clinical practice.”

“Models are increasingly capable,” said Peter Brodeur, MD, MA, the study’s co‑first author. “We used to evaluate models with multiple-choice tests; now they are consistently scoring close to 100 percent and we can’t track progress anymore because we’re already at the ceiling.”

Incorporating standards first created in the 1950s to train and evaluate doctors, the researchers compared how an AI system performed against hundreds of clinicians. The comparisons included case study diagnostic challenges, reasoning exercises, and real emergency department cases.

In one of their experiments, the investigators tasked the LLM with evaluating patients at various points in a standard emergency department setting, ranging from early triage to later admission decisions. At each stage, the model was given only the information available at that point — drawn directly from real‑world electronic health records — and asked to generate likely diagnoses and suggest what should happen next.

“To better understand real-world performance, we needed to test performance early in the patient course, when clinical data is sparse,” said co-first author Thomas Buckley, Harvard Kenneth C. Griffin School of Arts and Science doctoral student, Dunleavy Fellow in HMS’ AI in Medicine PhD program, and a member of Manrai’s lab.

Unlike many prior studies, the team did not smooth out the messiness of real‑world care before testing the AI. The emergency department cases were presented exactly as they appeared in the electronic health record. “We didn’t pre‑process the data at all,” Rodman said. “The model is literally just processing data as it exists in the health record.”

At the early decision points in the real-world emergency department cases, the model matched or exceeded attending physicians in diagnostic accuracy.

That result surprised even the researchers.

“I thought it was going to be a fun experiment but that it wouldn’t work that well,” Rodman said. “That was not at all what happened.”

The results make the case that medical AI is ready to be studied the same way as all new medical interventions — through carefully controlled clinical trials in real care settings. The researchers are clear that their results do not suggest that AI systems are ready to practice medicine autonomously, or that physicians can be removed from the diagnostic process.

“A model might get the top diagnosis right but also suggest unnecessary testing that could expose a patient to harm,” said Brodeur. “Humans should be the ultimate baseline when it comes to evaluating performance and safety.”


Source:
Materials provided by Harvard Medical School. Content may be edited for clarity, style, and length. For more details, including the full list of authors and their affiliations, please consult the journal article.


 

  Related Videos