
ChatGPT Health missed half of medical emergencies in first independent safety test
Key Takeaways
- A structured Mount Sinai evaluation used 60 vignettes across 21 specialties, physician-derived urgency levels from 56 society guidelines and 16 context permutations, producing 960 total ChatGPT Health interactions.
- Undertriage was concentrated in nuanced high-acuity states, with 52% of true emergencies downgraded to delayed outpatient care despite explanatory text sometimes recognizing red-flag features.
A Mount Sinai study found the consumer AI chatbot undertriaged 52% of cases that physicians agreed required emergency care.
Less than two months after
The answer, according to a study published February 23 in
Investigators at the
The study also flagged serious problems with the platform’s suicide-crisis safeguards. ChatGPT Health is designed to surface a banner directing users to the 988 Suicide and Crisis Lifeline when they describe thoughts of self-harm. But researchers found the alerts fired inconsistently, sometimes appearing in lower-risk conversations while failing to trigger when users described specific plans for self-harm.
Related content: “OpenAI launches ChatGPT Health, directly linking patient portals to the AI chatbot ”
How researchers tested ChatGPT Health
The team created 60 structured clinical scenarios spanning 21 medical specialties, ranging from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct urgency level for each case using guidelines from 56 medical societies.
Each scenario was then tested under 16 different contextual conditions, including variations in patient race, sex, social dynamics and barriers to care, such as lack of health insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health.
Performance followed an inverted U-shaped pattern, with the most dangerous failures concentrated at clinical extremes.
Among emergencies, the system undertriaged 52% of cases. On the other end of the spectrum, it overtriaged nearly 65% of nonurgent cases, recommending a physician visit when home care would have been sufficient. It performed best in the middle of the severity range.
“ChatGPT Health performed well in textbook emergencies, such as stroke or severe allergic reactions,” said study lead author Ashwin Ramaswamy, M.D., M.P.P., an instructor of urology at the Icahn School of Medicine at Mount Sinai. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most.”
Researchers noted that the large language model (LLM) often appeared to recognize danger in its own explanations but still reassured the patients. In one asthma scenario, the tool correctly identified early warning signs of respiratory failure in its written analysis, but advised waiting rather than seeking emergency treatment.
“Any doctor, and any person who’s gone through any degree of training, would say that that patient needs to go to the emergency [room],” Ramaswamy told
Suicide safeguard failures
The suicide-risk findings were among the study’s most alarming. Girish N. Nadkarni, M.D., M.P.H., the study’s senior author and chief AI officer of the Mount Sinai Health System, said the system’s crisis alerts appeared to be “inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves.”
He added, “In real life, when someone talks about exactly how they would harm themselves, that’s a sign of more immediate and serious danger, not less.”
The findings are even more concerning given OpenAI’s own disclosure that more than 1 million ChatGPT users each week send messages with explicit indicators of suicidal planning or intent, and an estimated 560,000 weekly users show possible signs of psychosis or mania.
OpenAI pushes back on methodology
A spokesperson for OpenAI said the company welcomed research on AI in health care but argued that the study did not reflect how ChatGPT Health is typically used or designed to function. The chatbot is built for multiturn conversations where patients can answer follow-up questions and provide additional context, the spokesperson said, rather than delivering a single triage recommendation from a single prompt.
The OpenAI spokesperson added that ChatGPT Health remains available to only a limited number of users and that the company is still working to improve the model’s safety and reliability before a broader rollout.
40 million people use ChatGPT every day for health advice
The study arrives against the backdrop of rapid consumer adoption. In January 2026, OpenAI’s data showed that more than 40 million people use ChatGPT for health-related questions every day, and roughly one in four of the platform’s 800-million-plus weekly users asks at least one health question in a given week.
Related content: “40 million people now use ChatGPT daily for health questions, OpenAI report finds ”
Approximately seven in 10 of those health conversations happen outside normal clinic hours, and more than 580,000 health-related messages per week come from communities that qualify as “hospital deserts” — areas more than a 30-minute drive from a hospital.
ChatGPT Health, which
OpenAI has described it as a tool for “support, not diagnosis,” and emphasized that it is not intended for clinical decision-making.
For physicians, though, the distinction between support and decision-making can blur quickly when patients use the tool to decide whether they want to drive to an emergency room at 11 p.m.
What patients are saying about ChatGPT Health
In a recent conversation with Medical Economics on
“We should be worried, and we need to be acutely aware,” Aznavorian said. “Many times, patients will come into their physician or come into the emergency room and say, ‘Well,
Aznavorian noted that the tool has a real educational upside by helping patients prepare better questions for their physicians but cautioned that symptoms can mimic one another in ways that require hands-on clinical assessment to sort out. “It might tell them their information is not a cardiac-related issue when it could be a cardiac-related issue,” she warned.
Amber Maraccini, Ph.D., vice president and head of health care and life sciences at Medallia, offered a more optimistic framing in an earlier
“AI, it’s not that it can make mistakes — it will make mistakes,” Maraccini said. “It’s not just a hypothetical, it’s a reality. And so, I think the best thing that we can do to empower patients is to educate them on those disclaimers. We need to make sure that we’re talking to patients about these tools that they’re ultimately going to be exploring, whether or not you’re promoting them.”
Maraccini said the real red flag for physicians is when a patient tells them it’s
Calls for independent evaluation
Isaac S. Kohane, M.D., Ph.D., chair of the Department of Biomedical Informatics at Harvard Medical School, who was not involved with the Mount Sinai study, said the findings underscore a systemic gap.
“LLMs have become patients’ first stop for medical advice — but in 2026, they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm,” Kohane said. “When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional.”
John Mafi, M.D., M.P.H., an associate professor of medicine and primary care physician at UCLA David Geffen School of Medicine, echoed that sentiment. “The message of this study is that before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you’re making sure that the benefits outweigh the harms,” Mafi told
The Mount Sinai team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, with future research expanding into pediatric care, medication safety and non-English-language use.
The study authors do not suggest that patients abandon AI health tools altogether. However, for worsening or concerning symptoms, patients should seek medical care directly, rather than relying solely on the guidance of AI chatbots. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency room.





