News|Articles|June 18, 2026

Medical AI scores high on exams but stumbles on real patient care, new benchmark finds

Listen

0:00 / 0:00

Key Takeaways

BRIDGE evaluates LLM comprehension of authentic clinical text from EHR notes, case reports, and patient-physician consultations rather than standardized, textbook-like exam prompts.
A large performance delta persists: top models can score ~92 on medical exams yet achieve only 44.8% on real-world clinical tasks, indicating limited practical language understanding.
Ninety-five LLMs were tested across 14 specialties on tasks including triage, information extraction, diagnosis, prognosis, and billing coding, with substantial inter-specialty variability.
Multilingual coverage across nine languages enables identification of failure modes affecting non-English care and supports development of more equitable clinical NLP systems.
A continuously updated public leaderboard compares 107 models on the same BRIDGE tasks to inform clinical adoption and track model improvements.

A new Mass General Brigham benchmark, BRIDGE, found the top-performing AI model struggled on tasks built from electronic health records and patient visits.

Large language models (LLMs) that score near the top of standardized medical exams perform far worse on the everyday language of patient care, according to a new benchmark from researchers at Mass General Brigham.

The benchmark, called BRIDGE, measures how accurately artificial intelligence (AI) models interpret real clinical text, including the notes physicians enter in electronic health records (EHRs). The findings were published in Nature Biomedical Engineering.

There was a notable gap between the two settings. The highest-performing model scored as high as 92 on standardized medical exams but earned just 44.8% on BRIDGE, meaning even the strongest performer handled fewer than half of the benchmark's clinical tasks. The researchers said the drop reflects gaps in the models' grasp of the nuanced language used in health care settings.

"Unlike many existing medical AI benchmarks, BRIDGE focuses on real-world clinical data sources that better reflect the complexity of real-world care," said Jie Yang, Ph.D., FACMI, FAMIA, the study's senior author and a researcher in the division of pharmacoepidemiology and pharmacoeconomics at Mass General Brigham. "BRIDGE can help clinicians select the right AI tools while guiding developers in improving model performance."

How BRIDGE tests AI on real clinical work

Medical AI models have typically been judged on licensing exam questions, which rely on standardized phrasing and textbook knowledge. Yang and colleagues built BRIDGE to test models against the material clinicians actually work with: text pulled from EHRs, clinical case reports and patient-physician consultations.

Using the benchmark, the team evaluated 95 LLMs on real-world clinical tasks that span the patient care continuum. Those tasks covered 14 clinical specialties and included triage, information extraction, diagnosis, prognosis and billing coding. Performance varied widely from one specialty to the next.

BRIDGE is also multilingual, drawing on clinical text in nine languages. The researchers said that lets them pinpoint where models fall short for non-English-speaking patients and steer development toward more accurate and equitable tools.

The team, comprised of Yang, co-senior author Joshua Lin, M.D., M.P.H., Sc.D., and co-first authors Jiageng Wu and Bowen Gu, also built a public leaderboard, updated on an ongoing basis, that now compares 107 models on the same clinical tasks.

Stay informed with the Medical Economics eNewsletter, delivering expert insights, financial strategies, practice management tips, and technology trends tailored for today’s physicians.

Latest CME

Video

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Nihar R. Desai, MD, MPH; Martha Gulati, MD, MS, FACC, FAHA, MASPC, FESC, FSCCT (hon), FRCP Edin

Medical AI scores high on exams but stumbles on real patient care, new benchmark finds

Key Takeaways

How BRIDGE tests AI on real clinical work

Related Content

Medicaid Fraud War Room flags $203 million; nearly 10,000 UC physicians move to unionize; Fauci takes the Fifth — Morning Medical Update Weekly Recap

Optimization or cherry-picking? The AI threat to accountable care

Why physicians consistently underestimate retirement spending

Senate health committee advances Erica Schwartz, M.D., J.D., M.P.H., for CDC director

4 ways to protect yourself from malpractice claims tied to AI scribes

Latest CME

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Trending on Medical Economics

Medicaid Fraud War Room flags $203 million; nearly 10,000 UC physicians move to unionize; Fauci takes the Fifth — Morning Medical Update Weekly Recap

Primary care advocates cheer gains in 2027 MPFS but push CMS and private payers for more

Optimization or cherry-picking? The AI threat to accountable care

Five surprising findings about the state of direct primary care

Why physicians consistently underestimate retirement spending