Researchers analyze ChatGPT for summarizing medical records.
The ChatGPT artificial intelligence (AI) program gets better at summarizing patient notes when physicians refine the prompts used to get it started.
Researchers as Stanford and Duke universities compared histories of present illnesses (HPIs) written by the popular AI program to those written by senior internal medicine students. The results got better over three rounds, with grades for ChatGPT and human physicians differing by less than a point, based on a 15-point composite scale.
“These findings underscore the potential of chatbots to aid clinicians with medical documentation,” said the research letter published in JAMA Internal Medicine.
In January and February, the research team used ChatGPT to generate HPIs based on three patient interview scripts portraying different types of chest pain.
ChatGPT generated 10 HPIs per script; those were evaluated for errors and were considered acceptable if they were free of errors. Then the researchers modified the prompt and repeated the process twice, with the acceptance rate growing from 10% to 43% over the three rounds.
From the final round, they picked one HPI per script to compare with four written by resident physicians. A total of 30 internal medicine attending physicians blindly evaluated the HPIs for level of detail, succinctness, and organization.
When guessing if the author was human or AI, the reviewing physicians were correct 61% of the time.
For ChatGPT, the most common error was adding patient ages and genders, which none of the scripts specified. The program also added information not present, an error called a hallucination.
ChatGPT’s “performance was heavily dependent on prompt quality,” the study said. “Without robust prompt engineering, the chatbot frequently reported information in the HPIs that was not present in the source dialogue.”
The researchers noted hallucinations occurred in prior tests of AI models.
“The generation of hallucinations in the medical record is clearly of great concern,” the study said. ChatGPT is a large language model (LLM) of generative AI.
“Before LLMs can be safely used in the clinical environment, close collaboration between clinicians and AI developers is needed to ensure that prompts are effectively engineered to optimize output accuracy,” the study said.
The researchers noted the study was limited by the version of ChatGPT available. A commentary said that version “used a massive data set comprised of published books, journals, and other Internet-based sources that were available through September 2021.”
As of March, “the program can access contemporaneous information from around the web,” said the editor’s note by Eric Ward, MD, and Cary Gross, MD.
ChatGPT was published in November 2022 and medical researchers began evaluating it almost instantly, they said, and research will continue.
“A new era is unfolding,” Ward and Gross said.
The research letter “Comparison of History of Present Illness summaries Generated by a Chatbot and Senior Internal Medicine Residents” and editor’s note “Evolving Methods to Assess Chatbot Performance in Health Sciences Research” were published in JAMA Internal Medicine.