
Can AI think like a physician?
A new Mass General Brigham study tested 21 large language models on the full clinical workflow and found a striking gap between the AI's ability to reach a final diagnosis and its ability to reason through one.
When a patient walks into a physician's office with a vague complaint, the diagnostic process rarely starts with a clear answer. It starts with uncertainty — a list of possibilities, a set of questions, a series of judgment calls.
That early, open-ended reasoning is precisely what
The study, led by researchers at Mass General Brigham's MESH Incubator, put 21 off-the-shelf large language models (LLMs), including the latest versions of
Although the models arrived at a correct final diagnosis more than 90% of the time when given complete patient information, they failed to generate an appropriate differential diagnosis more than 80% of the time. The gap between those two numbers is, the researchers argue, the gap between a search engine and a clinician.
To better capture that distinction, the team developed a new benchmark called PrIME-LLM, which scores models across five domains of clinical reasoning: differential diagnosis, diagnostic testing, final diagnosis, management and miscellaneous reasoning, and penalizes uneven performance rather than masking it through averaging.
Medical Economics spoke with corresponding author Marc Succi, M.D., executive director of the MESH Incubator at Mass General Brigham, about what the findings mean for physicians who are already using these tools in their practices.
The following interview has been edited for length and clarity.
What question sparked this research?
This was really a follow-up to our
This matters because many evaluations still overstate readiness by focusing on narrow benchmarks and final answers, but they do not really account for how a medical visit unfolds in real life. You do not just need a final answer. You need something like a differential diagnosis to inform what information gets collected and how you get to that final answer. We thought the current popular way of evaluating these models was not a proper real-world test.
Can you briefly walk through the study design and explain the “PrIME-LLM “metric that was used?
PrIME-LLM is a summary metric we developed to capture balanced performance across the clinical reasoning domains. If you just average results, a model can be really weak in one important area and still look better than it actually is.
So we split performance into domains like differential diagnosis, diagnostic testing, final diagnosis and management. The average can hide models that are really weak in one section. But the PrIME-LLM metric works off the area under a shape, so it rewards balanced, consistent performance and penalizes uneven performance.
In health care, it is all about predictability and consistency. So we think the methodology we developed is how these models should be evaluated.
What did the study find?
The main finding is that there is a big gap between final diagnosis and early-stage reasoning. The models did well on final diagnosis and fairly well on patient management, but differential diagnosis was clearly where they were weakest. Reasoning models were better, but even the strongest models were still poor in differential diagnosis specifically.
That matters because differential diagnosis is really the art of medicine. That is when you have limited, uncertain information — maybe just a one-line complaint from a patient — and you have to put together a list of possible diagnoses. Those possibilities inform what tests you order, how fast the patient gets to a final diagnosis and the cost of the visit.
So yes, a model may get to the right final diagnosis. But at what cost? Did it order too many tests? Did it delay care for a stroke patient by an extra hour because it was working through differentials that should not have been there? That is why the gap between differential diagnosis and final diagnosis is so important.
Was there anything in the results that surprised you?
I think what stood out was just how large the gap was between the failure rate on differential diagnosis and the performance on final diagnosis.
And compared with our original study, which used the same set of questions back in February 2023, differential diagnosis was also by far the worst-performing section there. Three years later, there really has not been a lot of performance improvement in differential diagnosis. The models are a little better overall, obviously, but they still cannot reason with limited information the way a doctor can.
You tested 21 different models, including familiar names like ChatGPT, Claude, Gemini and Grok. Were there meaningful differences between them?
They were similar, to be honest.
The reasoning models from each of those families did better than the non-reasoning ones. Grok and ChatGPT tended to be at the top when averaged across the whole family, but overall performance among the latest models was fairly similar.
Physicians are already using these tools. Based on this research, what should they be comfortable using AI for right now, and what should give them pause?
I would be comfortable using artificial intelligence for low-risk, high-feasibility tasks. That includes things like ambient documentation, visit notes, summarization, patient-friendly explanations and billing.
If you get those wrong, that is not great, and maybe you lose some money, but a patient does not suffer.
That is where I think artificial intelligence is good right now, and where it is genuinely helping. But once you start moving into higher-risk uses — clinical decision support, responding to patient messages, ordering lab tests, renewing medications, psychiatric medications — that is where you need to stop and look at the level of performance very critically and in multiple ways, not just trust what the vendors say.
So for me, low-risk, high-feasibility is where I like to see artificial intelligence right now.
A lot of people talk about keeping a human in the loop. But in practice, especially after a long day, that review can become more of a formality. How do you keep that from happening?
I don’t know that you can ensure it never happens. The human cannot just rubber-stamp the output, but the real issue is liability. How much risk is that physician willing to take?
You already see this in medicine when physicians sign off on nurse practitioner notes or approve resident documentation. Some are going to be more critical and spend more time reviewing than others. So I think this comes down to individual risk tolerance.
That said, physicians have to actively interrogate and critically appraise the output of artificial intelligence. For ambient documentation, if it ever becomes the standard, I think the tool should have to explain itself, not just hand over an output. That could help reduce some of the cognitive overload on physicians. There may be small design choices like that that help. But at the end of the day, I think it comes back to how much liability risk the doctor is willing to take.
Most of our audience is in primary care. If there is one takeaway they should keep from this study, what is it?
Do not mistake a final answer or a final diagnosis from one of these models for reliable, safe clinical reasoning.
Be aware that the gap between the initial presentation and the final diagnosis is where uncertainty lives, and that is also where the highest risk for error and safety problems exists. So if you are using artificial intelligence in those higher-risk areas — differential diagnosis and similar tasks — you have to be especially critical.
And this study looked at off-the-shelf models. Even then, off-the-shelf models are by far what
Is there anything else from the study you think is important to keep in mind?
It’s important to remember that this study reflects baseline reasoning ability. It is not assessing fine-tuned medical models. And the reason we did that is because of the usability problem. Inevitably, someone will ask, “Why didn’t you use this extremely narrow, fine-tuned medical model?” But that is not what most people are actually using.






