News|Articles|June 18, 2026

Medical AI scores high on exams but stumbles on real patient care, new benchmark finds

Fact checked by: Keith A. Reynolds
Listen
0:00 / 0:00

Key Takeaways

  • BRIDGE evaluates LLM comprehension of authentic clinical text from EHR notes, case reports, and patient-physician consultations rather than standardized, textbook-like exam prompts.
  • A large performance delta persists: top models can score ~92 on medical exams yet achieve only 44.8% on real-world clinical tasks, indicating limited practical language understanding.
SHOW MORE

A new Mass General Brigham benchmark, BRIDGE, found the top-performing AI model struggled on tasks built from electronic health records and patient visits.

Large language models (LLMs) that score near the top of standardized medical exams perform far worse on the everyday language of patient care, according to a new benchmark from researchers at Mass General Brigham.

The benchmark, called BRIDGE, measures how accurately artificial intelligence (AI) models interpret real clinical text, including the notes physicians enter in electronic health records (EHRs). The findings were published in Nature Biomedical Engineering.

There was a notable gap between the two settings. The highest-performing model scored as high as 92 on standardized medical exams but earned just 44.8% on BRIDGE, meaning even the strongest performer handled fewer than half of the benchmark's clinical tasks. The researchers said the drop reflects gaps in the models' grasp of the nuanced language used in health care settings.

"Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care," said Jie Yang, Ph.D., FACMI, FAMIA, the study's senior author and a researcher in the division of pharmacoepidemiology and pharmacoeconomics at Mass General Brigham. "BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance."

How BRIDGE tests AI on real clinical work

Medical AI models have typically been judged on licensing exam questions, which rely on standardized phrasing and textbook knowledge. Yang and colleagues built BRIDGE to test models against the material clinicians actually work with: text pulled from EHRs, clinical case reports and patient-physician consultations.

Using the benchmark, the team evaluated 95 LLMs on real-world clinical tasks that span the patient care continuum. Those tasks covered 14 clinical specialties and included triage, information extraction, diagnosis, prognosis and billing coding. Performance varied widely from one specialty to the next.

BRIDGE is also multilingual, drawing on clinical text in nine languages. The researchers said that lets them pinpoint where models fall short for non-English-speaking patients and steer development toward more accurate and equitable tools.

The team, comprised of Yang, co-senior author Joshua Lin, M.D., M.P.H., Sc.D., and co-first authors Jiageng Wu and Bowen Gu, also built a public leaderboard, updated on an ongoing basis, that now compares 107 models on the same clinical tasks.