News
Article
Researchers at Mount Sinai say simple prompts can reduce dangerous hallucinations from artificial intelligence models, but even the best safeguards don’t eliminate the risk.
© Antony Weerut - stock.adobe.com
Large language models (LLMs) used in clinical settings are alarmingly prone to repeating and elaborating on false medical information, according to a new study published August 2 in Communications Medicine.
The multi-modal analysis, conducted by researchers at the Icahn School of Medicine at Mount Sinai, found that artificial intelligence (AI) chatbots hallucinated fabricated diseases, lab values and clinical signs in up to 83% of simulated cases when no safety measures were in place.
The study tested six popular LLMs against 300 physician-designed vignettes, each containing a single false medical detail. Without any safeguards, the models not only accepted the fake information, but often proceeded to expand on it, producing confident explanations for non-existent conditions.
“What we saw across the board is that AI chatbots can be easily misled by false medical details, whether those errors are intentional or accidental,” said Mahmud Omar, M.D., lead author of the study and an independent consultant with the research team. “They not only repeated the misinformation but often expanded on it, offering confident explanations for non-existent conditions.”
Researchers embedded fictional terms into clinical scenarios — imaginary syndromes or fabricated lab tests — and asked the chatbots to interpret them. Under default settings, hallucination rates ranged from 50% to 82.7% across the six models.
The worse performer, a model known as Distilled-DeepSeek, hallucinated in more than 80% of cases. GPT-4o, OpenAI’s flagship model, performed the best, with a 53% hallucination rate under default conditions. Even so, that figure dropped to just 23% when researchers added a simple mitigation prompt — a one-line caution reminding the model that the input could contain inaccuracies.
“Even a single made-up term could trigger a detailed, decisive response based entirely on fiction,” said Eyal Klang, M.D., co-corresponding author and chief of generative AI in the Windreich Department of Artificial Intelligence and Human Health at the Icahn School of Medicine at Mount Sinai. “But we also found that the simple, well-timed safety reminder built into the prompt made an important difference, cutting those errors nearly in half.”
Changing model settings like lowering the temperature — a term that refers to how creative or cautious the AI’s responses should be — had no real impact on the results. Lowering a model’s temperature is often thought to make the model less likely to speculate, but in this case, it did little to reduce the spread of false and fabricated information.
The team defined a hallucination as any response that endorsed, elaborated on or treated the fictional detail as valid medical information. In contrast, a non-hallucinated answer was one that expressed uncertainty, flagged the input as potentially incorrect or avoided referencing the fake element altogether.
Prompt engineering, that is, crafting more precise or cautious instructions to steer AI responses toward safer and more accurate outputs, emerged as the most effective safeguard. Across all models, the hallucination rate dropped from a baseline average of 66% to 44% with the use of the mitigation prompt.
This study raises major safety concerns as AI tools become increasingly integrated into health care.
A single erroneous prompt — even if the result of a typo, copy-forward error or misheard symptom — could lead to convincingly incorrect outputs. The study’s authors argue that current AI systems lack the built-in skepticism required for clinical use.
“Our study shines a light on a blind spot in how current AI tools handle misinformation, especially in health care,” said Girish N. Nadkarni, M.D., M.P.H., co-corresponding senior author, chair of the Windreich Department of Artificial Intelligence and Human Health. “It underscores a critical vulnerability in how today’s AI systems deal with misinformation in health settings. A single misleading phrase can prompt a confident yet entirely wrong answer.”
The researchers plan to build on this study by testing models against real patient records and developing more robust and comprehensive safeguards. They also recommend using their fake-term method as a low-cost stress test before deploying AI tools in clinical settings.
“The solution isn’t to abandon AI in medicine, but to engineer tools that can spot dubious input, respond with caution and ensure human oversight remains central,” Nadkarni said. “We’re not there yet, but with deliberate safety measures, it’s an achievable goal.”
Stay informed and empowered with Medical Economics enewsletter, delivering expert insights, financial strategies, practice management tips and technology trends — tailored for today’s physicians.