Banner

News

Article

Even a small typo can throw off AI medical advice, MIT study says

Author(s):

Fact checked by:

Key Takeaways

  • AI models in healthcare are sensitive to grammar, formatting, and tone, impacting treatment recommendations and potentially disadvantaging vulnerable groups.
  • LLMs showed a 7-9% increase in recommending self-management over medical care when patient messages were stylistically altered.
SHOW MORE

MIT researchers find that large language models may shortchange women and vulnerable patients based on how clinical inquiries are typed.

© xy - stock.adobe.com

© xy - stock.adobe.com

Artificial intelligence (AI) models used to help triage patient messages may be far more sensitive to grammar, formatting and tone than previously believed, with disproportionate impacts on women and other vulnerable groups, a new Massachusetts Institute of Technology (MIT) study suggests.

The findings raise new concerns about fairness, safety and clinical oversight as large language models (LLMs) like OpenAI’s GPT-4 are deployed in clinical settings to help determine whether a patient should self-manage, come in for a visit or receive additional resources.

“[This] is strong evidence that models must be audited before use in health care — which is a setting where they are already in use,” said Marzyeh Ghassemi, Ph.D., senior author of the study and an associate professor at MIT. “LLMs are flexible and performant enough on average that we might think this is a good use case.”

Style over substance

The research — to be presented this week at the Association for Computing Machinery (ACM) Conference on Fairness, Accountability and Transparency — tested how nine stylistic and structural changes in patient messages impacted LLM treatment recommendations across more than 6,700 clinical scenarios. The changes included realistic variations: typos, dramatic language, extra white space, informal grammar and swapped or removed gender markers.

To test the effects, researchers employed a three-step process:

  • First, they created modified versions of patient messages by introducing small but realistic changes like typos or informal phrasing.
  • Then, they ran each original and altered message through an LLM to collect treatment recommendations.
  • Finally, they compared the difference between the LLM’s original and perturbed responses — looking at consistency, accuracy and disparities across subgroups. Human-validated answers were used as a benchmark.

Despite the fact that all clinical content was the same, the LLM’s responses were significantly different. Across all four models tested, including GPT-4, LLMs were 7-9% more likely to recommend self-management instead of medical care when messages were notably perturbed.

The most dramatic changes came when messages included colorful or uncertain language, suggesting patients with health anxiety or non-native English fluency may be at greater risk of being advised to stay home even when care is warranted.

Researchers also determined that LLMs were more likely to reduce care recommendations for female patients than male ones, even when gender cues were removed. The inclusion of extra white space increased reduced care errors by more than 5% for female patients.

“In research, we tend to look at aggregated statistics, but there are a lot of things that are lost in translation,” said Abinitha Gourabathina, lead author of the study and a graduate student in the MIT Department of Electrical Engineering and Computer Science (EECS). “We need to look at the direction in which these errors are occurring — not recommending visitation when you should is much more harmful than doing the opposite.”

In conversational formats meant to simulate patient-AI chatbots, clinical accuracy dropped by roughly 7% when messages were perturbed. The most affected scenarios involved free-form patient inputs, echoing real-world communications.

The team evaluated four different models on static and conversational datasets spanning oncology, dermatology and general medicine. Real clinicians had previously annotated each case with validated answers.

What it means

The study highlights what researchers describe as “brittleness” in AI medical reasoning — small, non-clinical differences in how a patient writes can steer care decisions in ways that clinicians would not.

Human physicians were not affected by the same changes. In follow-up work under review, researchers found that altering the style or tone of a message didn’t impact human clinicians’ judgment, further underscoring the fragility of LLMs.

Researchers say their findings support more rigorous auditing and subgroup testing before deploying LLMs in high-stakes settings, especially for patient-facing tools.

“This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care,” Ghassemi said. “… we don’t want to optimize a health care system that only works well for patients in specific groups.”

Related Videos