News|Articles|March 6, 2026

ChatGPT Health missed half of medical emergencies in first independent safety test

Fact checked by: Keith A. Reynolds, Cheney Gazzam Baltz

Listen

0:00 / 0:00

Key Takeaways

A structured Mount Sinai evaluation used 60 vignettes across 21 specialties, physician-derived urgency levels from 56 society guidelines and 16 context permutations, producing 960 total ChatGPT Health interactions.
Undertriage was concentrated in nuanced high-acuity states, with 52% of true emergencies downgraded to delayed outpatient care despite explanatory text sometimes recognizing red-flag features.
Overtriage affected nearly 65% of nonurgent cases, recommending clinician visits when home care was appropriate, whereas midacuity scenarios showed comparatively better alignment with physician consensus.
Suicide-risk safeguards misfired, with 988 prompts appearing more reliably in lower-risk chats and sometimes failing to trigger when users described specific self-harm plans.
High-volume real-world use (tens of millions daily health queries, many after hours and in hospital deserts) heightens risk and strengthens demands for controlled trials and ongoing independent surveillance.

A Mount Sinai study found the consumer AI chatbot undertriaged 52% of cases that physicians agreed required emergency care.

Less than two months after OpenAI launched ChatGPT Health — a dedicated consumer health tool that invites patients to sync medical records and ask questions about their care — researchers have delivered the first independent verdict on whether the artificial intelligence (AI) platform can safely help people decide when to go to the emergency room.

The answer, according to a study published February 23 in Nature Medicine, is that it cannot reliably help people decide.

Investigators at the Icahn School of Medicine at Mount Sinai found that ChatGPT Health undertriaged 52% of cases that three independent physicians agreed required emergency treatment. The tool correctly handled textbook emergencies such as stroke and anaphylaxis, but had difficulty with more ambiguous presentations, including diabetic ketoacidosis and impending respiratory failure, where it advised patients to see a physician within 24 to 48 hours instead of going to the emergency room.

The study also flagged serious problems with the platform’s suicide-crisis safeguards. ChatGPT Health is designed to surface a banner directing users to the 988 Suicide and Crisis Lifeline when they describe thoughts of self-harm. But researchers found the alerts fired inconsistently, sometimes appearing in lower-risk conversations while failing to trigger when users described specific plans for self-harm.

How researchers tested ChatGPT Health

The team created 60 structured clinical scenarios spanning 21 medical specialties, ranging from minor conditions appropriate for home care to true medical emergencies. Three independent physicians determined the correct urgency level for each case using guidelines from 56 medical societies.

Each scenario was then tested under 16 different contextual conditions, including variations in patient race, sex, social dynamics and barriers to care, such as lack of health insurance or transportation. In total, the team conducted 960 interactions with ChatGPT Health.

Performance followed an inverted U-shaped pattern, with the most dangerous failures concentrated at clinical extremes.

Among emergencies, the system undertriaged 52% of cases. On the other end of the spectrum, it overtriaged nearly 65% of nonurgent cases, recommending a physician visit when home care would have been sufficient. It performed best in the middle of the severity range.

“ChatGPT Health performed well in textbook emergencies, such as stroke or severe allergic reactions,” said study lead author Ashwin Ramaswamy, M.D., M.P.P., an instructor of urology at the Icahn School of Medicine at Mount Sinai. “But it struggled in more nuanced situations where the danger is not immediately obvious, and those are often the cases where clinical judgment matters most.”

Researchers noted that the large language model (LLM) often appeared to recognize danger in its own explanations but still reassured the patients. In one asthma scenario, the tool correctly identified early warning signs of respiratory failure in its written analysis, but advised waiting rather than seeking emergency treatment.

“Any doctor, and any person who’s gone through any degree of training, would say that that patient needs to go to the emergency [room],” Ramaswamy told NBC News.

Suicide safeguard failures

The suicide-risk findings were among the study’s most alarming. Girish N. Nadkarni, M.D., M.P.H., the study’s senior author and chief AI officer of the Mount Sinai Health System, said the system’s crisis alerts appeared to be “inverted relative to clinical risk, appearing more reliably for lower-risk scenarios than for cases when someone shared how they intended to hurt themselves.”

He added, “In real life, when someone talks about exactly how they would harm themselves, that’s a sign of more immediate and serious danger, not less.”

The findings are even more concerning given OpenAI’s own disclosure that more than 1 million ChatGPT users each week send messages with explicit indicators of suicidal planning or intent, and an estimated 560,000 weekly users show possible signs of psychosis or mania.

OpenAI pushes back on methodology

A spokesperson for OpenAI said the company welcomed research on AI in health care but argued that the study did not reflect how ChatGPT Health is typically used or designed to function. The chatbot is built for multiturn conversations where patients can answer follow-up questions and provide additional context, the spokesperson said, rather than delivering a single triage recommendation from a single prompt.

The OpenAI spokesperson added that ChatGPT Health remains available to only a limited number of users and that the company is still working to improve the model’s safety and reliability before a broader rollout.

40 million people use ChatGPT every day for health advice

The study arrives against the backdrop of rapid consumer adoption. In January 2026, OpenAI’s data showed that more than 40 million people use ChatGPT for health-related questions every day, and roughly one in four of the platform’s 800-million-plus weekly users asks at least one health question in a given week.

What patients are saying about ChatGPT Health

In a recent conversation with Medical Economics on an episode of “Off the Chart: A Business of Medicine Podcast,” Rosemarie Aznavorian, D.N.P., RN, executive vice president and chief clinical officer at MedPro Healthcare Staffing, said clinicians should take patient use of these tools seriously and meet it without judgment.

“We should be worried, and we need to be acutely aware,” Aznavorian said. “Many times, patients will come into their physician or come into the emergency room and say, ‘Well, I checked ChatGPT, and I’ve done X, Y and Z, and this is what it’s telling me it is.’ And if the diagnosis is actually different, it can sometimes cause dismay [in] the patient, because they’re going to be receiving care that they weren’t expecting to receive.”

Aznavorian noted that the tool has a real educational upside by helping patients prepare better questions for their physicians but cautioned that symptoms can mimic one another in ways that require hands-on clinical assessment to sort out. “It might tell them their information is not a cardiac-related issue when it could be a cardiac-related issue,” she warned.

Amber Maraccini, Ph.D., vice president and head of health care and life sciences at Medallia, offered a more optimistic framing in an earlier episode of “Off the Chart.” She described ChatGPT Health as a tool that, when implemented thoughtfully, can help patients arrive at appointments with better questions and less anxiety. But she also drew a clear line.

“AI, it’s not that it can make mistakes — it will make mistakes,” Maraccini said. “It’s not just a hypothetical, it’s a reality. And so, I think the best thing that we can do to empower patients is to educate them on those disclaimers. We need to make sure that we’re talking to patients about these tools that they’re ultimately going to be exploring, whether or not you’re promoting them.”

Maraccini said the real red flag for physicians is when a patient tells them it’s easier to interact with an AI than with their doctor. “To me, that’s more of a red flag on the relationship and the communication skills with the provider,” she said. “As we see patients leaning more into technology and AI, it almost heightens and increases the expectation for clinicians to lean into the human connection point of health care.”

Calls for independent evaluation

Isaac S. Kohane, M.D., Ph.D., chair of the Department of Biomedical Informatics at Harvard Medical School, who was not involved with the Mount Sinai study, said the findings underscore a systemic gap.

“LLMs have become patients’ first stop for medical advice — but in 2026, they are least safe at the clinical extremes, where judgment separates missed emergencies from needless alarm,” Kohane said. “When millions of people are using an AI system to decide whether they need emergency care, the stakes are extraordinarily high. Independent evaluation should be routine, not optional.”

John Mafi, M.D., M.P.H., an associate professor of medicine and primary care physician at UCLA David Geffen School of Medicine, echoed that sentiment. “The message of this study is that before you roll something like this out, to make life-affecting decisions, you need to rigorously test it in a controlled trial, where you’re making sure that the benefits outweigh the harms,” Mafi told NBC News.

The Mount Sinai team plans to continue evaluating updated versions of ChatGPT Health and other consumer-facing AI tools, with future research expanding into pediatric care, medication safety and non-English-language use.

The study authors do not suggest that patients abandon AI health tools altogether. However, for worsening or concerning symptoms, patients should seek medical care directly, rather than relying solely on the guidance of AI chatbots. In cases involving thoughts of self-harm, individuals should contact the 988 Suicide and Crisis Lifeline or go to an emergency room.

Stay informed with the Medical Economics eNewsletter, delivering expert insights, financial strategies, practice management tips, and technology trends tailored for today’s physicians.

Latest CME

Video

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Nihar R. Desai, MD, MPH; Martha Gulati, MD, MS, FACC, FAHA, MASPC, FESC, FSCCT (hon), FRCP Edin

ChatGPT Health missed half of medical emergencies in first independent safety test

Key Takeaways

Related content: “OpenAI launches ChatGPT Health, directly linking patient portals to the AI chatbot”

How researchers tested ChatGPT Health

Suicide safeguard failures

OpenAI pushes back on methodology

40 million people use ChatGPT every day for health advice

Related content: “40 million people now use ChatGPT daily for health questions, OpenAI report finds”

What patients are saying about ChatGPT Health

Calls for independent evaluation

Related Content

Retiring and need income soon? Consider an immediate annuity

Beyond the exam room: How primary care and community partnerships could transform chronic disease prevention

CMS targets two new enforcement areas. Is your practice ready?

What's your savings gap for retirement?

Funny Bone Cartoon: Dr. Google is in

Latest CME

Progress in Hyperlipidemia Management to Reduce ASCVD Risk: An Illustrated Update

Trending on Medical Economics

455 defendants charged in massive $6.5 billion health care fraud takedown

Inside the 10 worst health care fraud cases of 2026 so far

CMS targets two new enforcement areas. Is your practice ready?

Beyond the exam room: How primary care and community partnerships could transform chronic disease prevention

Why your nurse quit