Commentary|Articles|April 16, 2026

Can AI think like a physician?

Listen

0:00 / 0:00

A new Mass General Brigham study tested 21 large language models on the full clinical workflow and found a striking gap between the AI's ability to reach a final diagnosis and its ability to reason through one.

When a patient walks into a physician's office with a vague complaint, the diagnostic process rarely starts with a clear answer. It starts with uncertainty — a list of possibilities, a set of questions, a series of judgment calls.

That early, open-ended reasoning is precisely what artificial intelligence (AI) still cannot do well, according to new research published in JAMA Network Open.

The study, led by researchers at Mass General Brigham's MESH Incubator, put 21 off-the-shelf large language models (LLMs), including the latest versions of ChatGPT, Claude, Gemini, Grok and DeepSeek, through a stepwise clinical workflow using 29 standardized patient vignettes.

Although the models arrived at a correct final diagnosis more than 90% of the time when given complete patient information, they failed to generate an appropriate differential diagnosis more than 80% of the time. The gap between those two numbers is, the researchers argue, the gap between a search engine and a clinician.

To better capture that distinction, the team developed a new benchmark called PrIME-LLM, which scores models across five domains of clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, management and miscellaneous reasoning, and penalizes uneven performance rather than masking it through averaging.

Medical Economics spoke with corresponding author Marc Succi, M.D., executive director of the MESH Incubator at Mass General Brigham, about what the findings mean for physicians who are already using these tools in their practices.

The following interview has been edited for length and clarity.

What question sparked this research?

This was really a follow-up to our original study in February 2023, which was the first study of its kind on generative artificial intelligence chatbots. We wanted to see whether today’s frontier large language models — the publicly available ones, which are far more likely to be used than vendor-specific models — can actually reason through a patient case across the full clinical workflow and not just answer isolated medical questions.

This matters because many evaluations still overstate readiness by focusing on narrow benchmarks and final answers, but they do not really account for how a medical visit unfolds in real life. You do not just need a final answer. You need something like a differential diagnosis to inform what information gets collected and how you get to that final answer. We thought the current popular way of evaluating these models was not a proper real-world test.

Can you briefly walk through the study design and explain the “PrIME-LLM “metric that was used?

PrIME-LLM is a summary metric we developed to capture balanced performance across the clinical reasoning domains. If you just average results, a model can be really weak in one important area and still look better than it actually is.

So we split performance into domains like differential diagnosis, diagnostic testing, final diagnosis and management. The average can hide models that are really weak in one section. But the PrIME-LLM metric works off the area under a shape, so it rewards balanced, consistent performance and penalizes uneven performance.

In health care, it is all about predictability and consistency. So we think the methodology we developed is how these models should be evaluated.

What did the study find?

The main finding is that there is a big gap between final diagnosis and early-stage reasoning. The models did well on final diagnosis and fairly well on patient management, but differential diagnosis was clearly where they were weakest. Reasoning models were better, but even the strongest models were still poor in differential diagnosis specifically.

That matters because differential diagnosis is really the art of medicine. That is when you have limited, uncertain information — maybe just a one-line complaint from a patient — and you have to put together a list of possible diagnoses. Those possibilities inform what tests you order, how fast the patient gets to a final diagnosis and the cost of the visit.

So yes, a model may get to the right final diagnosis. But at what cost? Did it order too many tests? Did it delay care for a stroke patient by an extra hour because it was working through differentials that should not have been there? That is why the gap between differential diagnosis and final diagnosis is so important.

Was there anything in the results that surprised you?

I think what stood out was just how large the gap was between the failure rate on differential diagnosis and the performance on final diagnosis.

And compared with our original study, which used the same set of questions back in February 2023, differential diagnosis was also by far the worst-performing section there. Three years later, there really has not been a lot of performance improvement in differential diagnosis. The models are a little better overall, obviously, but they still cannot reason with limited information the way a doctor can.

You tested 21 different models, including familiar names like ChatGPT, Claude, Gemini and Grok. Were there meaningful differences between them?

They were similar, to be honest.

The reasoning models from each of those families did better than the non-reasoning ones. Grok and ChatGPT tended to be at the top when averaged across the whole family, but overall performance among the latest models was fairly similar.

Physicians are already using these tools. Based on this research, what should they be comfortable using AI for right now, and what should give them pause?

I would be comfortable using artificial intelligence for low-risk, high-feasibility tasks. That includes things like ambient documentation, visit notes, summarization, patient-friendly explanations and billing.

If you get those wrong, that is not great, and maybe you lose some money, but a patient does not suffer.

That is where I think artificial intelligence is good right now, and where it is genuinely helping. But once you start moving into higher-risk uses — clinical decision support, responding to patient messages, ordering lab tests, renewing medications, psychiatric medications — that is where you need to stop and look at the level of performance very critically and in multiple ways, not just trust what the vendors say.

So for me, low-risk, high-feasibility is where I like to see artificial intelligence right now.

A lot of people talk about keeping a human in the loop. But in practice, especially after a long day, that review can become more of a formality. How do you keep that from happening?

I don’t know that you can ensure it never happens. The human cannot just rubber-stamp the output, but the real issue is liability. How much risk is that physician willing to take?

You already see this in medicine when physicians sign off on nurse practitioner notes or approve resident documentation. Some are going to be more critical and spend more time reviewing than others. So I think this comes down to individual risk tolerance.

That said, physicians have to actively interrogate and critically appraise the output of artificial intelligence. For ambient documentation, if it ever becomes the standard, I think the tool should have to explain itself, not just hand over an output. That could help reduce some of the cognitive overload on physicians. There may be small design choices like that that help. But at the end of the day, I think it comes back to how much liability risk the doctor is willing to take.

Most of our audience is in primary care. If there is one takeaway they should keep from this study, what is it?

Do not mistake a final answer or a final diagnosis from one of these models for reliable, safe clinical reasoning.

Be aware that the gap between the initial presentation and the final diagnosis is where uncertainty lives, and that is also where the highest risk for error and safety problems exists. So if you are using artificial intelligence in those higher-risk areas — differential diagnosis and similar tasks — you have to be especially critical.

And this study looked at off-the-shelf models. Even then, off-the-shelf models are by far what patients are most likely to use, and what many physicians are most likely to have access to. Most doctors, especially most primary care doctors, do not have the infrastructure to deploy the latest narrow medical reasoning models. They are far more likely to be using ChatGPT or Claude, so I think these findings apply broadly.

Is there anything else from the study you think is important to keep in mind?

It’s important to remember that this study reflects baseline reasoning ability. It is not assessing fine-tuned medical models. And the reason we did that is because of the usability problem. Inevitably, someone will ask, “Why didn’t you use this extremely narrow, fine-tuned medical model?” But that is not what most people are actually using.

Stay informed with the Medical Economics eNewsletter, delivering expert insights, financial strategies, practice management tips, and technology trends tailored for today’s physicians.

Latest CME

Video

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Nihar R. Desai, MD, MPH; Martha Gulati, MD, MS, FACC, FAHA, MASPC, FESC, FSCCT (hon), FRCP Edin

Can AI think like a physician?

What question sparked this research?

Can you briefly walk through the study design and explain the “PrIME-LLM “metric that was used?

What did the study find?

Was there anything in the results that surprised you?

You tested 21 different models, including familiar names like ChatGPT, Claude, Gemini and Grok. Were there meaningful differences between them?

Physicians are already using these tools. Based on this research, what should they be comfortable using AI for right now, and what should give them pause?

A lot of people talk about keeping a human in the loop. But in practice, especially after a long day, that review can become more of a formality. How do you keep that from happening?

Most of our audience is in primary care. If there is one takeaway they should keep from this study, what is it?

Is there anything else from the study you think is important to keep in mind?

Related Content

Latest on the cyclospora outbreak; first oral PCSK9 inhibitor approved; AHA update: how much coffee is heart-safe? — Morning Medical Update

Inside CMS's push to rebalance Medicare payments toward primary care

Who's really liable when your AI scribe makes a mistake?

New magnetic switch chip aims to double CGM battery standby life

What most physicians get wrong about direct primary care, with Josh Umbehr, M.D.

Latest CME

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Trending on Medical Economics

'Remove fee for service': Gary Jacobs on premiums, physician risk and primary care-led reform

Who's really liable when your AI scribe makes a mistake?

Inside CMS's push to rebalance Medicare payments toward primary care

New magnetic switch chip aims to double CGM battery standby life

Why ChatGPT favors hospitals over independent practices