News
Article
MIT researchers find that large language models may shortchange women and vulnerable patients based on how clinical inquiries are typed.
© xy - stock.adobe.com
Artificial intelligence (AI) models used to help triage patient messages may be far more sensitive to grammar, formatting and tone than previously believed, with disproportionate impacts on women and other vulnerable groups, a new Massachusetts Institute of Technology (MIT) study suggests.
The findings raise new concerns about fairness, safety and clinical oversight as large language models (LLMs) like OpenAI’s GPT-4 are deployed in clinical settings to help determine whether a patient should self-manage, come in for a visit or receive additional resources.
“[This] is strong evidence that models must be audited before use in health care — which is a setting where they are already in use,” said Marzyeh Ghassemi, Ph.D., senior author of the study and an associate professor at MIT. “LLMs are flexible and performant enough on average that we might think this is a good use case.”
The research — to be presented this week at the Association for Computing Machinery (ACM) Conference on Fairness, Accountability and Transparency — tested how nine stylistic and structural changes in patient messages impacted LLM treatment recommendations across more than 6,700 clinical scenarios. The changes included realistic variations: typos, dramatic language, extra white space, informal grammar and swapped or removed gender markers.
To test the effects, researchers employed a three-step process:
Despite the fact that all clinical content was the same, the LLM’s responses were significantly different. Across all four models tested, including GPT-4, LLMs were 7-9% more likely to recommend self-management instead of medical care when messages were notably perturbed.
The most dramatic changes came when messages included colorful or uncertain language, suggesting patients with health anxiety or non-native English fluency may be at greater risk of being advised to stay home even when care is warranted.
Researchers also determined that LLMs were more likely to reduce care recommendations for female patients than male ones, even when gender cues were removed. The inclusion of extra white space increased reduced care errors by more than 5% for female patients.
“In research, we tend to look at aggregated statistics, but there are a lot of things that are lost in translation,” said Abinitha Gourabathina, lead author of the study and a graduate student in the MIT Department of Electrical Engineering and Computer Science (EECS). “We need to look at the direction in which these errors are occurring — not recommending visitation when you should is much more harmful than doing the opposite.”
In conversational formats meant to simulate patient-AI chatbots, clinical accuracy dropped by roughly 7% when messages were perturbed. The most affected scenarios involved free-form patient inputs, echoing real-world communications.
The team evaluated four different models on static and conversational datasets spanning oncology, dermatology and general medicine. Real clinicians had previously annotated each case with validated answers.
The study highlights what researchers describe as “brittleness” in AI medical reasoning — small, non-clinical differences in how a patient writes can steer care decisions in ways that clinicians would not.
Human physicians were not affected by the same changes. In follow-up work under review, researchers found that altering the style or tone of a message didn’t impact human clinicians’ judgment, further underscoring the fragility of LLMs.
Researchers say their findings support more rigorous auditing and subgroup testing before deploying LLMs in high-stakes settings, especially for patient-facing tools.
“This is perhaps unsurprising — LLMs were not designed to prioritize patient medical care,” Ghassemi said. “… we don’t want to optimize a health care system that only works well for patients in specific groups.”