News|Articles|December 23, 2025

Are AI scribes safe?

Author(s)Todd Shryock
Fact checked by: Chris Mazzolini
Listen
0:00 / 0:00

Key Takeaways

  • AI scribes can introduce errors like hallucinations, omissions, and misinterpretations, affecting patient care and safety.
  • Disparities in transcription accuracy for diverse demographics highlight the need for equitable AI scribe performance.
SHOW MORE

Researchers raise concerns that the rapid uptake of AI-powered scribes may be outrunning proper oversight

Artificial intelligence–powered medical scribes are rapidly moving from pilot projects to everyday clinical tools, promising to ease one of medicine’s most persistent pain points: documentation. By automatically capturing and summarizing clinician–patient conversations, these systems aim to give physicians back time, reduce burnout, and allow more focus on patient care. Adoption has been swift. An estimated 30% of physician practices now use some form of AI scribe, reflecting both the intensity of the documentation burden and the allure of a technological fix.

But that rapid uptake is also raising red flags. In a recent commentary published in npj Digital Medicine, researchers from Columbia University and the University of Eastern Finland warn that AI scribes are being deployed faster than the evidence, standards, and oversight needed to ensure they are safe and equitable. While early results suggest benefits, the authors argue that key questions about accuracy, reliability, and bias remain unresolved. Speech recognition systems underpinning AI scribes have been shown to perform less accurately for Black patients compared with White patients, and similar issues may affect people with non-standard accents or limited English proficiency—raising concerns about incomplete or distorted clinical records.

Compounding these issues, many AI scribes are classified as administrative tools rather than medical devices, allowing them to bypass U.S. Food and Drug Administration regulation altogether. The researchers argue that this regulatory gap leaves both patients and clinicians exposed, particularly as clinical decision-making increasingly relies on AI-generated documentation.

Medical Economics spoke with Maxim Topaz, PhD, RN, MA, associate professor, Columbia University School of Nursing & Data Science Institute Senior Research Scientist, VNS Health, and lead author of the research paper, about what the current research tells us, where the risks lie, and what safeguards—ranging from validation standards to greater vendor transparency—are needed to ensure these tools enhance care without undermining trust.

Medical Economics: What types of clinical errors or omissions are most likely to occur with current AI scribe systems, and how might those errors impact patient care?

Maxim Topaz: We see four main types of errors. First, AI hallucinations: the system fabricates information. We've seen cases where AI documented physical exams that never happened or created diagnoses out of thin air. Second, omissions: essential things get left out. A patient mentions chest pain, but it doesn't make it into the note. Third, misinterpretations: the AI misinterprets the context. A patient reports discontinuing a medication, and it is documented as a new prescription. Fourth, speaker mix-ups: the system confuses who said what, so patient statements get attributed to the clinician or vice versa. The safety implications are real. A normal-appearing examination could mask a serious condition. A missing symptom could delay diagnosis. An incorrect medication note could lead to an adverse event. Studies report hallucination rates around 1-3%, which sounds low until you multiply it by millions of encounters.

Medical Economics: Based on your findings, which parts of a clinical encounter (history, exam details, medication changes, assessment/plan) are most vulnerable to AI inaccuracy?

Topaz: Physical exams seem especially prone to hallucinations. Systems have been caught documenting entire examinations that never occurred. That's dangerous because a busy clinician reviewing an AI note might miss fabricated content that appears plausible. Medication documentation is tricky, too, especially when you're discussing dose changes or conversationally stopping medications. The assessment and plan sections are vulnerable because they require clinical reasoning that AI can approximate but not truly perform. But here's what often gets overlooked: patient-reported symptoms and concerns. Our research found that approximately 50% of patient problems discussed aloud were never documented, even by human clinicians. AI scribes with unclear filtering rules may make this worse, not better. Nuanced conversations about patient preferences or social factors are also easy for AI to miss or oversimplify.

Medical Economics: Your commentary highlights disparities in transcription accuracy for Black patients and those with non-standard accents. What safeguards should clinicians insist on from vendors to ensure equitable transcription and documentation across demographic groups?

Topaz: Our research and others have shown that these systems perform worse for Black patients and people with non-standard accents. This is unsurprising when you consider how AI models are trained, but it's a serious problem. Clinicians should require that vendors provide accurate, disaggregated data by patient demographics: race, ethnicity, primary language, and accent. Ask vendors to explain how they tested their system across diverse populations and what they're doing to fix disparities. Make regular performance reports by demographic group a contract requirement, not an optional extra. And insist on a clear process for reporting and fixing problems when you find them. Without this transparency, we're deploying tools that systematically create worse documentation for patients who already face health care disparities. That's not acceptable.

Medical Economics: What can individual practices do now to assess whether their AI scribe is performing inconsistently across patient populations?

Topaz: Don't wait for vendors to give you this data. Start doing your own audits. Pull a diverse sample of charts and compare the AI-generated notes to what was actually said in the encounter. Track accuracy by patient demographics. Set up a simple system for clinicians to flag errors and look for patterns in who's affected. Pay special attention to notes from encounters with patients who have accents, speak English as a second language, or come from underserved communities. Compare clinicians' note quality to determine whether some communication styles are more effective with the AI than others. Ask patients whether the notes reflect what they actually discussed. Make this ongoing, not a one-time thing. The goal is to catch systematic problems before they harm patients.

Medical Economics: If an AI scribe introduces an error that harms a patient, where does liability currently fall: on the clinician, the organization, or potentially the vendor?

Topaz: Right now, liability mostly falls on clinicians and health care organizations, while vendors stay protected. Here's why: AI scribes are usually classified as administrative tools, not medical devices, so they skip FDA oversight entirely. Vendor contracts typically include liability limits that push responsibility onto health care organizations. Professional groups in other countries have started calling for clearer rules about who's accountable when AI causes harm, but we're not there yet. Until the legal landscape changes, clinicians need to understand that they're likely liable for documentation errors, regardless of whether AI made them. That makes careful review of every AI-generated note essential, not optional. Organizations should also ensure their malpractice coverage explicitly covers AI-assisted documentation.

Medical Economics: What sort of regulatory framework would you recommend to balance innovation with patient safety?

Topaz: We need action at multiple levels. At the federal level, FDA needs to close the loophole that lets AI scribes avoid oversight by calling themselves administrative tools. These systems directly affect clinical documentation, so they should meet real safety standards. Before going to market, vendors should show accuracy testing across diverse populations, report their hallucination and error rates, and be transparent about limitations. State health departments should set implementation standards. Hospitals require internal governance to deploy and monitor these tools. However, this cannot be solely top-down regulation. Physicians and nurses should be involved in validation studies. Administrators need to ensure proper training. Vendors should prioritize transparency. And patients should know what's happening and give informed consent. Everyone has a role.

Medical Economics: How much review time should clinicians realistically expect to spend verifying AI-generated notes?

Topaz: Here's the honest answer: the evidence is a mess. I recently compared the major AI scribe studies from 2025 and found six studies measuring time savings in six different ways. The numbers don't add up. A large Swedish study found clinicians self-reported spending 4.7 minutes on notes, but objective editing time was only 93 seconds. That's a threefold gap. UCLA's trial found 41 seconds saved per note. Stanford found 0.57 minutes. So what's really happening? AI scribes seem to reduce clinicians' burden without dramatically cutting actual time. That's valuable, but it's not the same as productivity gains. My concern is that organizations will raise patient volume expectations based on promised efficiency gains, only to have clinicians squeezed when the time savings don't materialize. Plan to read every AI-generated note carefully. Don't assume accuracy.

Medical Economics: If you were advising a small or medium-sized practice, what are the top three questions they should ask an AI scribe vendor?

Topaz: First: Show me your accuracy data. What are your hallucination rates, omission rates, and how does the system perform across different patient populations? Don't accept vague claims about being highly accurate. Ask whether independent researchers validated the system or only the company. Second: How does your system work, and what are its limitations? If a vendor can't or won't explain when their technology is likely to make errors, that's a red flag. You need to understand the blind spots. Third: What happens to my patients' data? Who owns the recordings? How are they stored? Can they be used for AI training or sold to third parties? Many patients don't expect their clinical conversations to be used for commercial AI development. Also, ask about liability provisions and what training support they provide for reviewing AI-generated content.

Medical Economics: Are there particular specialties or clinical scenarios where AI scribes appear to be safer, or riskier, based on current evidence?

Topaz: The evidence is still coming in, but some patterns are emerging. Straightforward, predictable encounters appear to be lower risk: routine dermatology follow-ups and primary care wellness visits. The clinical content is structured and easier for AI to capture. Higher-risk scenarios include encounters with multiple speakers, such as family meetings or visits with interpreters, where the system struggles to track who said what. Psychiatry and behavioral health visits are concerning because nuance and subtext matter so much. Emergency settings where information flows fast may overwhelm the technology. Pediatric visits can get complicated when kids, parents, and clinicians are all talking at once. And any encounter with patients who have accents or speech differences is higher risk given the performance disparities we've documented. Honestly, most of medicine requires precise documentation, so caution is warranted broadly.

Medical Economics: Should clinicians disclose to patients when an AI scribe is being used during the visit? If so, what's the best way to frame that conversation?

Topaz: Yes, absolutely. Patients have a right to know their conversations are being recorded and processed by AI. Beyond the ethical reasons, disclosure might actually help: patients may speak more clearly or flag when something is really important. Keep it simple and non-alarming. Something like: 'I use an AI assistant that listens to our conversation and helps me with notes, so I can focus on you instead of typing. Everything stays confidential as part of your medical record. Is that okay?' That explains it without scaring anyone. Be prepared to proceed without the AI if a patient declines. And know that legal requirements vary by state; some require two-party consent for recording.

Medical Economics: What level of accuracy and transparency do AI scribes need to reach before you would consider them ready for widespread, safe clinical adoption?

Topaz: I'd want to see several things. First, hallucination rates near zero for safety-critical content like medications, allergies, and diagnoses. The current 1-3% rates are too high when you're talking about millions of patient encounters. Second, equal performance across demographic groups, with transparent data showing no systematic disparities by race, language, or accent. Third, real transparency about how the system works and when it's likely to fail, not black-box technology. Fourth, systems in place for ongoing monitoring, error reporting, and improvement. Fifth, updated regulations that actually hold vendors accountable. The bottom line: the question isn't whether to adopt these tools. They have real potential to reduce burnout and documentation burden. The question is how to adopt them responsibly. Right now, we can't confidently say that AI scribes improve care without creating new risks. We need to get there.

Newsletter

Stay informed and empowered with Medical Economics enewsletter, delivering expert insights, financial strategies, practice management tips and technology trends — tailored for today’s physicians.