. 2025 Jul 1;25:236. doi: 10.1186/s12911-025-03061-0

Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review

Joel Jia Wei Ng ^1,^#, Eugene Wang ^1,^#, Xinyan Zhou ^1,^#, Kevin Xiang Zhou ^2,^#, Charlene Xing Le Goh ¹, Gabriel Zheng Ning Sim ¹, Hiang Khoon Tan ^3,^4,⁵, Serene Si Ning Goh ^1,^6,⁷, Qin Xiang Ng ^4,^7,^✉

PMCID: PMC12220090 PMID: 40598136

Abstract

Background

Clinical documentation is vital for effective communication, legal accountability and the continuity of care in healthcare. Traditional documentation methods, such as manual transcription, are time-consuming, prone to errors and contribute to clinician burnout. AI-driven transcription systems utilizing automatic speech recognition (ASR) and natural language processing (NLP) aim to automate and enhance the accuracy and efficiency of clinical documentation. However, the performance of these systems varies significantly across clinical settings, necessitating a systematic review of the published studies.

Methods

A comprehensive search of MEDLINE, Embase, and the Cochrane Library identified studies evaluating AI transcription tools in clinical settings, covering all records up to February 16, 2025. Inclusion criteria encompassed studies involving clinicians using AI-based transcription software, reporting outcomes such as accuracy (e.g., Word Error Rate), time efficiency and user satisfaction. Data were extracted systematically, and study quality was assessed using the QUADAS-2 tool. Due to heterogeneity in study designs and outcomes, a narrative synthesis was performed, with key findings and commonalities reported.

Results

Twenty-nine studies met the inclusion criteria. Reported word error rates ranged widely, from 0.087 in controlled dictation settings to over 50% in conversational or multi-speaker scenarios. F1 scores spanned 0.416 to 0.856, reflecting variability in accuracy. Although some studies highlighted reductions in documentation time and improvements in note completeness, others noted increased editing burdens, inconsistent cost-effectiveness and persistent errors with specialized terminology or accented speech. Recent LLM-based approaches offered automated summarization features, yet often required human review to ensure clinical safety.

Conclusions

AI-based transcription systems show potential to improve clinical documentation but face challenges in accuracy, adaptability and workflow integration. Refinements in domain-specific training, real-time error correction and interoperability with electronic health records are critical for their effective adoption in clinical practice. Future research should also focus on next-generation “digital scribes” incorporating LLM-driven summarization and repurposing of text.

Clinical trial number

Not applicable.

Supplementary Information

The online version contains supplementary material available at 10.1186/s12911-025-03061-0.

Keywords: Artificial intelligence, Speech recognition, Digital scribe, Ambient scribe, Clinical documentation, Accuracy

Background

Clinical documentation, defined as the systematic recording of a patient’s medical history, diagnoses, treatment plans and care provided, remains a cornerstone of effective healthcare. It is critical in ensuring accurate communication among healthcare providers, legal accountability, and continuity of care [1]. However, traditional documenting methods, such as manual note-taking or transcription, are often labour-intensive, prone to errors and can detract from the quality of patient-clinician interactions [2, 3]. These inefficiencies not only contribute to clinician burnout but also risk compromising the accuracy of medical records and patient safety.

In recent years, Artificial Intelligence (AI) has begun to transform clinical documentation through the use of advanced technologies like automatic speech recognition (ASR), large language models (LLM) and natural language processing (NLP) [4, 5]. These AI-driven transcription tools automate the process of converting spoken language into structured electronic medical records (EMRs), thereby alleviating the burden of manual data entry [5]. By streamlining this process, AI transcription systems offer the potential to improve the accuracy and completeness of clinical documentation while allowing clinicians to focus more on patient care and communication.

Despite the promise of AI in this domain, the effectiveness of AI transcription tools remains inconsistent across different clinical settings [4]. Studies report varying levels of accuracy, time savings and user satisfaction [4, 5]. While some tools demonstrate significant improvements in documentation speed and precision, others face challenges with speech recognition (SR) errors, the need for manual post-editing and inconsistencies in real-world clinical use [4, 5]. These mixed outcomes highlight the complexity of integrating AI tools into diverse healthcare environments and underscore the need for a thorough evaluation of their performance.

This review aims to synthesize the current evidence on AI transcription tools, focusing on their accuracy, efficiency and usability in clinical practice. By examining the successes and challenges of implementing these technologies, the review seeks to provide insights that can guide the development and integration of AI-driven documentation systems, ultimately shaping the future of clinical workflows and improving the quality of patient care.

Methods

Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [6] for identifying, selecting and synthesizing evidence, the protocol for this review was developed and registered in PROSPERO (registration number CRD42024597200).

Search strategy

A comprehensive literature search was performed on February 16, 2025, and it was conducted across multiple electronic databases, including MEDLINE (via OVID), Embase and the Cochrane Library, covering all records published up to February 16, 2025. The search strategy was developed in consultation with the help of medical information experts to identify studies that evaluated the performance of AI-based medical transcription software in clinical settings. Key words and the following Medical Subject Headings (MeSH) were applied: “Artificial Intelligence”, “Digital Scribe”, “Medical Transcription”, “Speech Recognition”, “Natural Language Processing”, “Electronic Health Records” and “Clinical Documentation”. Details of the search strategy can be found in the Supplementary Material (Table S1).

Additionally, grey literature was searched via Google search engine to capture relevant non-peer-reviewed studies. To further enhance the comprehensiveness of the search, forward and backward searching were performed on the reference lists of relevant studies to identify additional literature that may not have been captured in the initial search.

Inclusion and exclusion criteria

The inclusion criteria for this systematic review were as follows: the population of interest included studies that involved clinicians, such as physicians and nurses, who used AI-based transcription software for clinical documentation. The intervention of focus was the use of AI-driven transcription tools, which may include technologies like ASR, LLM, and NLP systems. Eligible studies must report on one or more key outcomes, such as transcription accuracy (measured through Word Error Rate or WER), time savings, clinician satisfaction or the impact on patient care. The review included empirical studies of various designs, including randomized controlled trials (RCTs), cohort studies, cross-sectional studies, comparative evaluations and proof-of-concept studies. Only studies published in English or with an English translation, and indexed up to February 16, 2025, were considered.

Studies that did not involve AI-based transcription tools were excluded. This included studies conducted in clinical settings which did not involve a physician facing a patient, such as laboratory-based evaluations, reports generated by radiologists and/or pathologists, as well as those focusing on non-English language transcription. Additionally, conference abstracts, editorials, commentaries and opinion pieces that did not provide empirical data were excluded as well.

Study selection

All identified studies were imported into Covidence (Veritas Health Innovation, Melbourne, Australia) to facilitate the screening process. Four independent reviewers (J.J.W.N., E.W., C.X.L.G. and G.Z.N.S.) screened titles and abstracts to exclude studies that did not meet the inclusion criteria. Studies passing this initial screening underwent a full-text review by two independent reviewers (J.J.W.N. and E.W.). Discrepancies in study inclusion were resolved through discussion, with arbitration by a third, senior reviewer (Q.X.N., H.K.T. or S.S.N.G.) if necessary.

Data extraction

Data were extracted from each included study using a standardized data extraction form developed for this review. The extracted data encompassed study characteristics, including the software or model used, the type of AI model, study design, clinical setting and country in which the study was conducted. Additionally, details about the study population, sample size, whether the study was vendor-initiated, the reference standard and the comparator type were also recorded. Key performance metrics, such as F1 score, precision, recall, and WER, along with paper-specific outcomes related to AI transcription proficiency and key findings, were also included. Novel features of the AI transcription tools were documented to provide a comprehensive overview of the studies.

Assessment of risk of bias and study quality

The Quality Assessment of Diagnostic Accuracy Studies (QUADAS-2) tool [7] was used to systematically assess the risk of bias and applicability of the studies included in our review. QUADAS-2 is a widely used tool that evaluates risk of bias across four key domains: patient selection, index test, reference standard and flow and timing [7]. Each domain was independently assessed for potential bias by reviewing key study characteristics and determining if the conduct or interpretation of results could have introduced bias. We also evaluated whether the applicability of each study matched our review question.

For each domain, the risk of bias was rated as either “low,” “high,” or “unclear,” depending on the completeness and clarity of the study’s reported methods. Specifically, we looked at factors such as the selection of patients or datasets, how the index test was performed and interpreted, whether the reference standard was appropriate, and if there were any exclusions that could have influenced the results. Any discrepancies were resolved through discussion among the reviewers (E.W., J.J.W.N. and X.Z.). This careful assessment ensured that the studies included in the analysis were reliable and applicable to our research objectives.

Data synthesis

Given that a meta-analysis was not feasible due to anticipated heterogeneity in the study design, interventions and outcomes, a narrative synthesis was conducted, as guided by Popay et al. [8]. Findings were narratively synthesized by summarizing key outcomes such as transcription accuracy, clinician satisfaction, impact on patient care, and usability, and by identifying common patterns across the studies.

Results

Literature retrieval

A total of 5,244 records were initially identified through database searches. After removing 1,011 duplicates using Covidence, 4,233 studies were screened based on titles and abstracts. During this screening phase, 4,210 studies were excluded for not meeting the inclusion criteria, leaving 60 studies for full-text review. All of these studies were retrieved for detailed assessment. After applying the inclusion and exclusion criteria, 25 studies were included. As illustrated in Fig. 1, an additional four studies were identified through the forward and backward citation searching, bringing the final total to 29 studies. The key study characteristics and findings of the 29 studies [4, 9–36] are summarised in Tables 1 and 2, respectively.

Fig. 1 — PRISMA flowchart showing the study selection process

Table 1.

Key study characteristics for studies included in this review

Study	Software/Model	Type of AI Model	Study Design	Clinical Setting	Country	Inpatient/ Outpatient	Sample Size	Vendor initiated
Zick et al., 2001 [9]	Dragon NaturallySpeaking Medical Suite version 4.0	ASR and NLP	Comparative evaluation	Emergency Department (ED)	USA	Inpatient	47	Yes
Happe et al., 2003 [10]	Speech recognition: Naturally speaking 5 professional version IBM via voice 8 professional version Automatic indexing: Nomindex	NLP	Comparative evaluation	Hospital medical reports of inpatients admitted for coronography.	France	Inpatient	28	Yes
Mohr et al., 2003 [11]	Domain (medical specialty) acoustic and language models were built on site using LTI model-building software provided by the vendor (Linguistics Technology, Inc, Edina, MN)	ASR	RCT	Endocrinology and Psychiatry	USA	Outpatient	3052	No
Issenman et al., 2004 [12]	Dragon Naturally Speaking version 6	ASR	Comparative evaluation	Paediatric gastroenterology	Canada	Outpatient	72	No
Almario et al., 2015 [13]	Automated Evaluation of Gastrointestinal Symptoms (AEGIS)	NLP	Cross-sectional	Gastrointestinal clinics	USA	Outpatient	75	Yes
Almario et al., 2015 [14]	AEGIS	NLP	Cross-sectional	Gastrointestinal clinics	USA	Outpatient	75	Yes
Suominen et al., 2015 [15]	Speech recognition: Dragon Medical 11.0 Information extraction: Conditional random field model	Dragon Medical 11.0: ASR CRF: Probabilistic graphical model	Comparative observational and Comparative experimental	Synthetic dataset of handover records	Australia	Inpatient	101	No
Hodgson et al., 2017 [16]	Nuance Dragon Medical 360 Network Edition UK (version 2.0, 12.51.200.072) SR software	NLP	RCT with a within-subjects design	Standardized clinical documentation likely seen in ED physicians	Australia	Inpatient	35	No
Kodish-Wachs et al., 2018 [17]	ASR Engines: Bing Speech API (BING), Google Cloud Speech API (Google), IBM Speech to Text (IBM), Azure Media Indexer (MAVIS), Azure Media Indexer 2 Preview (MAVIS v2), Nuance.SpeechAnywhere (Nuance), Amazon Transcribe Preview (Transcribe), Mozilla DeepSpeech (DeepSpeech) NLP engines (used to extract clinical concepts) commercially available NLP service and open source NLP service (CLiX NOTES, Clinithink, Bridgend CF31 1LH, UK) and biomedical annotator (https://bioportal.bioontology.org/annotator).	ASR	Systematic comparison	Simulated clinical scenarios likely seen in ambulatory primary care practice	USA	Outpatient	34	No
Lybarger et al., 2018 [18]	Logistic regression was selected for the sentence-level edit detection model; the linear-chain continual random fields (CRF) modelling framework was selected for the word-level edit detection model	Logistic regression/CRF: Probabilistic graphical model	Error detection and correction models	Inpatient progress notes created by resident and attending internal medicine physicians	USA	Inpatient	669	No
Zhou et al., 2018 [19]	Trans_BBN Trans_I2B2	Transfer learning using CRF model	Comparative evaluation	Synthetic nursing handover data	Australia	Inpatient	301	Yes
Goss et al., 2019 [20]	Dragon Medical 360	ASR	Controlled observational	Clinicians from Brigham and Women’s Hospital were surveyed	USA	Mixed	245	No
Blackley et al., 2020 [21]	Dragon Medical One	ASR	Controlled observational	Simulated primary care clinical encounters	Australia	Outpatient	40	No
Van Woensel et al., 2022 [22]	Mozilla DeepSpeech CMU Sphinx	DeepSpeech: RNN CMU Sphinx: Hidden Markov Model	Comparative evaluation	ED and non-ED clinical notes	Canada	Mixed	107	No
Balloch et al., 2024 [4]	TORTUS AI	ASR, LLM	Simulated consultations	Clinic	UK	Outpatient	8	Yes
Bundy et al., 2024 [23]	DAX Copilot	ASR, LLM, NLP	Qualitative study	Clinic	USA	Outpatient	12	Yes
Cao et al., 2024 [24]	Dragon Ambient eXperience (DAX)	ASR, LLM, NLP	Pre-post evaluation	Dermatology	USA	Outpatient	12	Yes
Harbele et al., 2024 [25]	DAX Copilot	ASR, LLM, NLP	Cohort study	Clinic	USA	Outpatient	99	Yes
Islam et al., 2024 [26]	Extended Long Short-Term Memory (LSTM)	NLP, RNN	System evaluation	Diabetes clinical visit	Bangladesh	Outpatient	102	Yes
Liu et al., 2024 [27]	DAX Copilot	ASR, LLM, NLP	Nonrandomized trial	Clinic	USA	Outpatient	85	Yes
Misurac et al., 2024 [28]	Ambient Artificial Intelligence Tool	NLP	Controlled observational	Patient-clinician conversations across multiple sub-specialties (e.g. paediatrics, family medicine, internal medicine, O&G)	USA	Outpatient	38	No
Owens et al., 2024 [29]	DAX Copilot	ASR, LLM, NLP	Prospective observational study	Primary care	USA	Outpatient	304	Yes
Sezgin et al., 2024 [30]	BART-Large-CNN, PEGASUS-PubMed, T5-small, T5-base	ASR, NLP, LLM	Comparative evaluation	ED consultation calls	USA	Outpatient	100	No
van Buchem et al., 2024 [31]	Autoscriber (transformer-based speech-to-text model, fine-tuned on proprietary clinical data for transcription and a mixture of large language models such as GPT-3.5 and GPT-4, combined with a tailored prompt structure and additional rules for summarization)	NLP, LLM	System evaluation	Simulated internal medicine consultation, focused on various presentations of chest pain	Netherlands	Outpatient	430	No
Biro et al., 2025 [32]	Ambient Digital Scribe (ADS)	LLM	Instrument validation	Dialogue scripts based off real encounters from various specialties (e.g. otolaryngology, internal medicine, family medicine, pediatrics and urgent care)	USA	Outpatient	11	No
Duggan et al., 2025 [33]	DAX Copilot	ASR, LLM, NLP	Pre-post quality improvement study	Mix of medical specialties	USA	Outpatient	46	Yes
Ma et al., 2025 [34]	DAX Copilot	ASR, LLM, NLP	Prospective quality improvement study	Ambulatory settings	USA	Outpatient	45	Yes
Moryousef et al., 2025 [35]	Multiple AI scribes (including Nabla, Scribe MD)	NLP	Evaluation study	Urology	Canada	Outpatient	20	No
Shah et al., 2025 [36]	DAX Copilot	ASR, LLM, NLP	Prospective quality improvement study	Ambulatory settings	USA	Outpatient	48	Yes

Open in a new tab

Abbreviations: AEGIS: Automated Evaluation of Gastrointestinal Symptoms; ASR: Automatic Speech Recognition; CRF: Conditional random field; DAX: Dragon Ambient eXperience; LLM, large language model; RNN: Recurrent Neural Network

Table 2.

Key study findings for studies reviewed

Study	Reference Standard	Comparator Type	Subcategories	Performance Metric (F1 score, Precision, Recall, WER)		AI Transcription Proficiency (paper-specific outcomes)	Key Findings	Novel Features
Zick et al., 2001 [9]	Voice recognition chart was used as a basis for the traditional transcription chart	Standard Care	- Accuracy (%) - Average no. errors/chart - Average dictation and correction time (min) - Average turnaround time for receipt of a completed document (min) - Throughput words/min	NR		Difference between voice recognition and transcription (95% CI) 1.2 (0.8–1.5) 1.3 (0.67–1.88) 0.12 (–0.34–0.58) 35.95 (24.59–47.31) 40.4 (34.4–46.39)	Charts dictated using the voice recognition program were considerably less costly than the manually transcribed charts; computer voice recognition is nearly as accurate as traditional transcription, has a much shorter turnaround time and is less expensive than traditional transcription. Recommended its use as a tool for physician charting in the ED.	The authors recommended using templates and macros to insert standard examination findings, as this can reduce transcription time and the likelihood of transcription errors; this could also ensure that only clinically relevant portions of the exam are included, making documentation more efficient and accurate.
Happe et al., 2003 [10]	Set of keywords manually extracted from the initial document.	Gold standard	“Naturally speaking” SR tool: - Medical Vocabulary - General vocabulary - Misspelling - Numbers - Punctuation marks - Total	Precision 0.73 Recall 0.90		Baseline (“at blank”) 8.23 3.59 2.34 6.02 0.36 5.67 Trained (“with learning”) 4.31* 2.67* 1.87* 7.14 0.54 4.21*	IBM via voice with medical vocabulary outperformed all other configurations using naturally speaking. Index evaluation of Nomindex is comparable to previous evaluations. At the current stage, the precision (73%) is too low to rely on this indexing but the recall (90%) is high enough to use it as a machine-aided indexing tool.	Study noted that indexing errors need improvement, especially those pertaining to word ambiguities and abbreviations; recommend further refinement of the indexing tools used in ASR systems to mitigate these errors. The authors recognize the limitations in handling non-English terms; suggest developing ASR systems with multilingual capabilities, which could increase usability in diverse clinical settings.
Happe et al., 2003 [10]		Gold standard	“IBM via voice” SR tool: - Medical Vocabulary - General vocabulary - Misspelling - Numbers - Punctuation marks - Total	Precision 0.73 Recall 0.90		Baseline (“at blank”) 3.68 1.13 0.52 2.31 0.43 1.80 Trained (“with learning”) 2.16 1.22 0.66 1.19 0.00 1.71
Mohr et al., 2003 [11]	Productivity was compared via the total time to complete transcription using standard transcription versus speech recognition (SR)	Standard care	- Endocrinology - Psychiatry transcriptionists - Psychiatry secretaries	NR		Productivity for SR as a percentage of productivity for standard transcription (95% CI) 87.3 (83.3, 92.3) 63.3 (54.0, 74.0) 55.8 (44.6, 68.0)	Overall, secretaries and transcriptionists using SR could complete less dictation than transcriptionists using standard transcription. SR was better for secretaries who were slower and, perhaps, for longer dictation jobs.	Performance varies depending on the speaker. Individual users’ speech patterns, such as accent and speed, impact transcription productivity and accuracy. Identifying a subset of users for whom transcription technology works well and optimising the tool for various speech characteristics could improve its efficiency. Productivity can be targeted for specific tasks. Certain tasks, like longer clinical notes, showed improved productivity with SR. Focusing on optimizing the system for specific document types or task lengths could improve overall system efficacy.
Issenman et al., 2004 [12]	Transcription of physician dictated letter by a medical transcriptionist	Gold standard	- Voice Recognition Software (VRS) - Control	WER* 0.087 0.00027		Total time in minutes per letter (baseline)* 14 3.3 Total time in minutes per letter (trained)* 9.6 3.3 Cost comparison* $15,290/ year $6970/year	VRS decreases turnaround time of physicians’ notes from 1 week to 1 day. The use of VRS was 66% less efficient in total time, and costs twice as much as conventional transcription. VRS is best suited for practices in which highly repetitive phrases reoccur. At this time, VRS does not seem to have reached the threshold for general adoption.	The study suggests a blended approach where a transcriptionist reviews and corrects initial voice-recognition outputs; could improve efficiency by reducing the need for physicians to spend extensive time on transcription correction while ensuring high accuracy.
Almario et al., 2015 [13]	Physician generated History of Presenting Illness (HPI)	Blinded comparison	- Overall impression - Completeness - Relevance - Organisation - Succinctness - Comprehensibility - Number of Medicare-recommended elements present in HPI	NR		Mean of Physician HPI Ratings (SD) 2.80 (0.75) 2.73 (0.75) 3.04 (0.68) 2.80 (0.80) 3.17 (0.60) 2.97 (0.79) 5.27 (1.52) Mean of Computer-generated HPI Ratings (SD) 3.68 (0.61)* 3.70 (0.59)* 3.82 (0.54)* 3.66 (0.63)* 3.55 (0.69)* 3.66 (0.66)* 6.05 (0.98)*	Blinded raters deemed the computer-generated HPIs to be of higher quality, more comprehensive, better organized, and with greater relevance compared to physician-documented HPIs. These results offer initial proof-of-principle that a computer can create meaningful and clinically relevant HPI.	The study noted that the computer-generated HPIs were more complete and organized but still required improvements in accuracy and consistency, especially for detailed patient histories. Further development was needed for seamless integration of AEGIS with EHRs to ensure that computer-generated HPIs align well with real-time physician documentation workflows, which are sensitive to time constraints and accuracy demands.
Almario et al., 2015 [14]	Number of positive alarm features documented in physician generated History of Presenting Illness (HPI)	Blinded comparison	- All patients - Patients presenting for an initial visit - Patients who completed AEGIS within 1 week of their clinic visit	NR		Median number of positive alarm features in Physician HPIs (interquartile range) 0 (0–1) 0 (0–1) 0 (0–1) Median number of positive alarm features in AEGIS HPIs (interquartile range) 1 (0–2)* 1 (0–2)* 1 (0–2)*	Computer generated HPIs documented more alarm features than physician generated HPIs. Physicians may be under reporting alarm features in GI clinics. Yet, greater documentation of red flags is not shown to improve patient outcomes.	More advanced algorithms necessary to improve the accuracy of detecting and documenting GI symptoms. Specifically, machine learning models could be further developed to ensure the automatic detection of all relevant clinical features. Current systems may miss certain symptoms due to variability in patient language; suggest enhancing systems to handle diverse phrasing, which could prevent underreporting by AEGIS.
Suominen et al., 2015 [15]	Original written, free-form text documents by the registered nurse	Gold standard	- Irrelevant text - Macro-averaged over 35 nonempty categories in a form	F1 0.856 0.702 Precision 0.794 0.759 Recall 0.929 0.653 WER (mean) 0.43		NR	Cascaded SR and IE to fill out a handover form for clinical proofing and sign-off provide a way to make clinical documentation more effective and efficient;also improves accessibility and availability of existing documents in clinical judgment, situational-awareness, and decision making.	Improving accuracy in identifying relevant information in clinical narratives was a key focus; this could involve better information extraction techniques for categorizing free-form speech into structured formats. The study suggests further testing with real-world clinical scenarios to ensure robustness.
Hodgson et al., 2017 [16]	Standardized version of a commercial electronic health record system	Gold standard	- Simple task Keyboard and Mouse (KBM) - Simple task SR - Complex task KBM - Complex task SR	NR		Mean task completion time (s) 112.38 131.44 170.48 201.84 Potential for patient harm (major errors) 2 29 11 21 Typographical errors 142 133 71 119	Using SR for clinical documentation was significantly slower, with task completion times being 18.11% longer compared to using KBM. SR also resulted in a higher number of errors, with 390 errors observed in SR compared to 245 in KBM, including errors with potential patient harm. While task interruptions did not significantly impact performance, the increased error rates and slower documentation times highlight safety and efficiency concerns with SR in its current form.	The study highlighted the high error rates associated with SR, particularly for complex clinical tasks, recommending further improvements in SR technology to reduce errors caused by misinterpretation of clinical terms. It was noted that current SR systems were not optimally integrated with EHR systems, leading to workflow inefficiencies and errors.
Kodish-Wachs et al., 2018 [17]	Professionally transcribed and annotated recordings with speaker and time index. Extracted clinical concepts using the same commercially-available NLP engine and open source NLP.	Gold standard	- Bing Speech API (BING), - Google Cloud Speech API (Google), - IBM Speech to Text (IBM), - Azure Media Indexer (MAVIS), - Azure Media Indexer 2 - Preview (MAVIS v2), - Nuance.SpeechAnywhere (Nuance), - Amazon Transcribe Preview (Transcribe), - Mozilla DeepSpeech (DeepSpeech)	F1 > 0.60 > 0.60 > 0.60 > 0.60 > 0.60 > 0.60 > 0.60 > 0.40 Precision > 0.80 > 0.80 > 0.60 > 0.60 > 0.80 > 0.80 > 0.80 > 0.40	Recall > 0.40 > 0.60 > 0.60 > 0.60 > 0.60 > 0.40 > 0.60 > 0.20 WER 49% 44% 38% 41% 35% 58% NR 65%	NR	The achievable perfomance of contemporary ASR engines, when applied to conversational clinical speech as measured by WER and clinical concept extraction, is disappointing with WER of approximately 50% and concept extraction rates of approximately 60%. Limited number of use cases where this level of performance is adequate.	The study highlights low recall rates for clinical concepts in primary care settings. This could be improved by incorporating better domain-specific vocabulary and adapting ASR engines specifically for medical language. Recognizing the differences in error rates between doctors and patients, the study recommends fine-tuning ASR systems for diverse speaker roles and ensuring better handling of complex dialogues between multiple speakers.
Lybarger et al., 2018 [18]	Word-level gold standard labels were determined based on the keep or delete labels from the note alignments. Sentence-level gold standard labels were determined as follows: sentences were labeled as delete when all word-level labels were delete and sentences were labeled as keep when at least one word-level label in the sentence was kept.	Gold standard	Sentence level performance: - Structure - Words - combined VGEENS	F1 0.39 0.44 0.48 Precision 0.35 0.37 0.40 Recall 0.44 0.55 0.59		NR	A substantial number of sentence- and word-level edits in inpatient progress notes can be automatically detected with a small false detection rate; promising results that warrant further exploration, including the procurement of additional training data.	The study points out the need for systems to automatically flag likely errors in ASR output, which would help reduce the workload of manual editing and improve transcript quality. Using medical context to detect and correct errors more accurately, especially for disrupted speech and ambiguous terms, is recommended for enhancing transcription quality.
Lybarger et al., 2018 [18]		Gold standard	Word-level performance: - Words - combined external	F1 0.28 0.31 Precision 0.23 0.25 Recall 0.36 0.42		NR
Zhou et al., 2018 [19]	NICTA Synthetic Nursing Handover Data in written and voice form.	Gold standard	- Trans_BBN - Trans_I2B2	F1 0.416 0.392 Precision 0.498 0.481 Recall 0.419 0.390		NR	DL system using pretrained word representations as the input, and the proposed transfer learning technique, is able to achieve better performance. Transferring knowledge from general deep models to speciic tasks in healthcare helps gain a significant improvement.	Need for automated error detection and correction in transcription, as current systems often produce clinically significant errors that could impact patient care. The study also suggests incorporating multiple rounds of annotation and validation processes to improve transcription accuracy. Training on domain-specific terminologies and including more advanced error correction models may mitigate transcription inaccuracies, particularly for complex medical terms.
Goss et al., 2019 [20]	NR	Survey-based self-assessment of SR accuracy, efficiency, and satisfaction	Efficiency and satisfaction	NR		Percentage of user agreement 77.1	69% of respondents used SR for 75–100% of their patients. Higher satisfaction was linked to greater efficiency and fewer errors. Odds of satisfaction increased as user efficiency increased and as the number of errors and editing time decreased. Future studies should examine the influence of other factors (e.g., accent) on SR usability and accuracy.	Some clinicians provided positive feedback, sharing that they could spend more time consulting and dealing with their patient’s issues rather than bearing the burden of documentation. However, some clinicans preferred human transcriptionists who could process what was spoken, providing a summarised transcript that did not require additional time to proofread for errors and misheard words. The study documented the use of SR in context of creating notes in the electronic health record. However, newer systems can be directly utilised in the exam room or a mobile device to capture the clinical encounter.
Blackley et al., 2020 [21]	Annotated and analysed recordings that captured information about the documentation process, errors and corrections.	Blinded comparison	- Uncorrected errors - Corrected errors - Documentation time - Quality score†	NR		Dictation mean 1.5 4.1 4 m 23s 7.7 Typing mean 2.9 33.9* 5 m 18s 6.6*	No evidence was found that SR is significantly more efficient or accurate for creating clinical notes despite clinicians feeling that SR saves time, increases efficiency and accuracy. SR generated documents had errors that had to be manually corrected.	Even with SR experience, participants encountered frequent SR errors, particularly with medical terms, names, and abbreviations. Dictation required extensive correction efforts, highlighting the need for improved error detection and correction mechanisms to alleviate the editing burden. This improvement would support clinicians who may be less adept at real-time editing while dictating. Although SR-produced notes were more comprehensive, they sometimes included redundant information, suggesting a need to optimize SR systems for conciseness without losing detail.
Van Woensel et al., 2022 [22]	Written files of the dataset were used for comparison	Gold standard	- Full length - Short, fixed-length - Short, var-length	WER CMU Sphinx 0.7 (baseline), 0.41 (trained) 0.76 (baseline), 0.57 (trained) 0.53 (baseline), 0.38 (trained) Mozilla DeepSpeech 0.48 (baseline), 0.28 (trained) 0.71 (baseline), 0.43 (trained) 0.46 (baseline), 0.28 (trained)		NR	Mozilla DeepSpeech outperforms CMU Sphinx in clinical transcription accuracy, with notable improvements in WER when using custom-trained language models. Short variable-length audio recordings, split on detected silences, also demonstrate transcription accuracy comparable to full-length recordings; DeepSpeech offers faster processing times, indicating its potential for real-time clinical applications, although concerns about generalizability and responsiveness remain.	There is a need for custom-trained language models, especially for clinical environments. Custom LMs significantly improve accuracy over generic models, suggesting that transcription accuracy can be further enhanced by incorporating clinician-specific vocabulary and data. Continuous adaptation of the language model with newly transcribed data could address current limitations in handling various speech styles, accents, and clinical terminologies, especially in emergency settings where rapid and context-specific language use is prevalent.
Balloch et al., 2024 [4]	Sheffield Assessment Instrument for Letters	Standard EHR documentation	Clinical documentation quality	NR		AI-produced documentation had higher SAIL scores	AI documentation improved quality, reduced consultation length by 26.3%, and decreased clinician task load.	Utilized a pipeline of AI models, including LLMs and speech-to-text transcription
Bundy et al., 2024 [23]	NR	Traditional manual documentation (dictation, typing)	NR	NR		NR	Most physicians felt DAX Copilot reduced their workload, especially those who dictated notes after work. The AI tool allowed for more patient engagement during visits. Not all encounters were suitable for AI documentation, some physicians found it useful for complex visits, while others preferred it for routine ones. Errors in transcription and AI-generated content required physician review and edits. Some physicians worried that DAX Copilot’s efficiency would lead to higher patient loads.	Explored real-time AI-driven clinical documentation using GPT-4. Highlighted issues such as AI errors, over-documentation, and potential physician burnout risks. Findings contributed to Atrium Health’s decision to expand its use to 2,500 licenses across specialties.
Cao et al., 2024 [24]	Time in notes per visit/week	Pre-DAX versus Post-DAX	Productivity metrics	NR		Time spent per day in EMRs decreased from 90.1 to 70.3 min	DAX users reported reduced documentation stress, improved accuracy and increased patient satisfaction.	AI scribe adapted to clinician habits over time, specialty-specific notes
Harbele et al., 2024 [25]	Provider engagement survey	Control versus AI-assisted	Provider engagement and documentation burden	NR		Positive trend in engagement but increased after-hours EHR time	AI documentation showed no significant benefit to patient experience or productivity but improved provider engagement.	Examined CPT code submission rates and documentation timeliness
Islam et al., 2024 [26]	Handwritten scribe	Non-blinded comparison	NR	NR		Mean ratings for scribe (mean ± SD): 4.33 ± 0.022 Number of attempts for performing the assigned tasks for all patients (mean ± SD): 1.285 ± 0.034 Task Completion Time for performing the assigned tasks for all patients (mean minutes ± SD): 2.28 ± 0.51 Number of asked help for performing the assigned tasks for all patients: 0.283 ± 0.024 Overall satisfaction (out of 5): 4.67 Easiness to use (out of 5): 4.22 Easiness to learn (out of 5): 4.27 Recommend to Others (out of 5): 4.75	Voice-driven intelligent system for generating medical scribes and prescriptions, demonstrating ease of use and proven viability; employed both extractive and abstractive summarization, selecting an LSTM model for higher accuracy over traditional NLP approaches.	This study establishes a hybrid method that simultaneously generates and evaluates a digital scribe and electronic prescription for diabetes. The study’s developed system provides the option to review and edit the system-generated scribes and prescriptions, reducing the chance of error..
Liu et al., 2024 [27]	Clinician survey on EHR experience	Pre-intervention versus post-intervention	Time in documentation, frustration levels	NR		Decreased documentation time, frustration, and after-hours EHR use	Clinicians using AI documentation tools experienced lower frustration and less time spent on documentation.	Measured outcomes using AMA Organizational Biopsy EHR-specific survey
Misurac et al., 2024 [28]	NR	Pre and post-intervention	Burnout score (Stanford Professional Fulfillment Index) Work exhaustion score Interpersonal disengagement score Professional fulfillment	NR		4.16 (pre), 3.16 (post)* 5.0 (pre), 4.2 (post) 3.6 (pre), 2.5 (post)* 6.1 (pre), 6.5 (post)	Significant reduction in burnout as measured by the Stanford Professional Fulfillment Index (PFI); achieved by reducing the burden of documentation and cognitive load required of clinicians during a consultation	Reported significant reduction in burnout among physicians, with burnout rates decreasing from 69–43%. Study achieved a 92% survey completion rate, indicating strong engagement and usability of ambient AI technology among participants, supporting its potential for broader implementation in healthcare settings.
Owens et al., 2024 [29]	NR	With versus without DAX	Patient satisfaction and engagement	NR		No significant difference in Patient-Doctor Relationship Questionnaire-9 (PDRQ-9) scores	Patients perceived less clinician distraction with AI scribing, but no improvement in perceived engagement	Used both open-label and masked study designs
Sezgin et al., 2024 [30]	Nurse summary notes from EMR	Zero-shot versus Fine-tuned models	Summarization accuracy and recall	ROUGE-1, ROUGE-2, ROUGE-L		BART-Large-CNN had highest performance: ROUGE-1 F1 = 0.49, ROUGE-2 F1 = 0.23, ROUGE-L F1 = 0.35	Fine-tuned models significantly outperformed zero-shot models; BART-Large-CNN was best for ED consultations	Two-stage evaluation: ROUGE metric comparison and manual annotation for information recall accuracy
van Buchem et al., 2024 [31]	Highest scoring manual summary	Blinded comparison	Manual Automatic summaries (edited by humans) Automatic summaries	Recall-Oriented Understudy for Gisting Evaluation-1 F1 score in % (IQR): 47.3 (42.5–56.4) 40.6 (35.0-45.4) 32.3 (27.0-37.4) P value: <0.001		Modified Physician Documentation Quality Instrument (PDQI-9), overall (IQR): Manual: 31 (27–33)* AS edited: 29 (26–33)* AS: 25 (22–28)* P value: <0.001	The study explores the impact of a digital scribe system on the clinical documentation process, demonstrating the use of the system in reducing summarization time while maintaining summary quality through collaborative editing, this study highlights the potential of digital scribe systems to address the challenges of clinical documentation.	Collaboration between the system and students leads to the best results, with a decrease in time spent on summarising in combination with a similar quality when compared to manual summarisation. Automatic summaries had higher word count and lower lexical diversity.
Biro et al., 2025 [32]	Expert-reviewed transcripts from real patient encounters	Comparison of ADS generated notes with expert-reviewed transcripts	Omission, addition, wrong output, misplaced/ irrelevant text	WER (SD) 2.9 (2.7)		Errors by ADS (mean) 44 5 5 9.5	Omission type errors, whereby the tool leaves out key information from its response, were the most common, which poses safety risks; clinicians may struggle to identify omission errors due to reliance on memory recall. Standardised evaluation frameworks and real-world testing required to mitigate AI-related safety concerns.	Study was independently conducted and focused on real-world usability and safety concerns rather than just technical accuracy. AI tools require continuous independent testing, as when tested with real-world scenarios, many errors arose that could compromise patient safety. Given that proprietary AI algorithms frequently evolve, ongoing safety assessments are essential.
Duggan et al., 2025 [33]	Time in notes per appointment	Pre-post intervention	Time efficiency and clinician burden	NR		20.4% less time spent in notes per appointment	AI-assisted documentation reduced after-hours work by 30% and increased same-day appointment closure by 9.3%.	Mixed-method evaluation including survey feedback and usability scales
Ma et al., 2025 [34]	EHR usage time metrics	Baseline versus AI intervention	Time spent on notes, documentation burden	NR		Median daily documentation reduced by 6.89 min	AI scribing led to significant time savings but had variability across users.	Integrated AI-generated SmartSections in EHR workflows
Moryousef et al., 2025 [35]	Standardized consultation notes	Multiple AI scribes	Accuracy and usability	NR		Nabla had the highest accuracy (68%)	75% of urologists found AI scribes useful; concerns about accuracy remain.	Assessed multiple AI scribe platforms for clinical documentation
Shah et al., 2025 [36]	NR	Pre-intervention versus post-intervention	Physician burnout and usability	NR		Burnout decreased significantly (-1.94 points)	AI documentation improved perceptions of efficiency and reduced task load.	Assessed usability using Stanford physician survey

Open in a new tab

Abbreviations: ADS: Ambient Digital Scribe; CPT: Current Procedural Terminology; DAX: Dragon Ambient eXperience; EHR: electronic health record; EMR: electronic medical record; HPI: History of presenting illness; IQR: Interquartile range; KBM: keyboard and mouse; NR: not reported; LSTM: Long Short-Term Memory; SD: standard deviation; SR: speech recognition; VRS: voice recognition software; WER: word error rate

*indicates statistical significance

(†Adapted scoring system for HPI and Assessment & Plan (AP): Clarity (of HPI and AP), completeness (of HPI and AP), concision (of HPI and AP), sufficiency (of HPI and AP), prioritisation (of AP))

Study results

Table 1 provides an overview of each study’s design, setting, participant information, AI transcription tools and indication of vendor involvement (or not). Study designs ranged from RCTs [11, 16] to comparative or observational studies [9, 10, 12, 20, 21], with a growing number of recent publications employing qualitative or pre-post approaches to capture both performance metrics and user perspectives [23, 28, 33–36]. These studies spanned diverse environments including emergency departments (EDs), inpatient wards, specialty outpatient clinics (e.g., gastroenterology, urology, dermatology) and simulated clinical scenarios.

Key findings

Accuracy and error rates

While some systems demonstrate impressive precision and recall Happe et al. achieved a precision of 0.73 and recall of 0.90 in a specialized vocabulary environment [10] and Suominen et al. reached F1 scores of up to 0.856 for nursing tasks [15], other studies highlight notable shortcomings. For instance, Lybarger et al. reported a much lower maximum F1 score of 0.49 [18], and Zhou et al. found an F1 score of 0.416 in nursing contexts despite real-world training data [19]. Similarly, WER ranged from as low as 0.087 in controlled scenarios (Issenman et al. [12]) to more than 2.9 in real-time, multi-specialty outpatient encounters (Biro et al. [32]). van Buchem et al. demonstrated modest ROUGE F1 scores (0.32 unedited vs. 0.41 human-edited) for automated summaries [31]; the fact that human editing still improved these outputs underscores the potential but also the current limitations of LLM-driven summarization.

Workflow efficiency and time savings

Results on workflow efficiency were mixed. While Zick et al. [9] and Issenman et al. [12] observed decreased turnaround time (from days to hours or minutes), Hodgson et al. [16] and Blackley et al. [21] found that post-editing often negated potential time gains. More recent LLM-based systems (e.g., Bundy et al. [23], Ma et al. [34]) claim to shorten overall documentation time for certain specialties, but these claims often relied on small sample sizes or single-site studies, limiting generalizability.

Cost implications

Cost-effectiveness was inconclusive across the included studies. Early work in EDs (Zick et al. [9]) suggested significant cost savings with ASR, whereas Issenman et al. [12] found that voice recognition could be twice as expensive in pediatric gastroenterology. These differences highlight how cost can vary based on clinical setting, complexity of cases and existing staffing models.

Clinical documentation quality and patient care

Studies such as Almario et al. [13, 14] showed that AI-assisted documentation captured more clinically relevant red flags than physician-typed notes. Other research (e.g., Kodish-Wachs et al. [17], Sezgin et al. [30]) observed that high error rates or poor summarization fidelity could pose risks to patient safety, especially if omissions go unnoticed by overburdened clinicians. Notably, more recent digital scribes and LLM-based solutions are designed to create structured summaries (e.g., SOAP notes), although these still generally require human review for accuracy.

Clinician satisfaction, burnout and adoption

Goss et al. [20] identified higher satisfaction when clinicians encountered fewer transcription errors and minimal editing demands. Misurac et al. [28] and Shah et al. [36] further reported decreased burnout levels following the adoption of AI tools, indicating a potential benefit for clinician well-being. Despite improved satisfaction in some quarters, many clinicians expressed reluctance to rely fully on AI scribing, citing concerns about real-time error correction and the need for manual review (e.g., Bundy et al. [23], Moryousef et al. [35]).

Risk of bias and study quality

Most of the included studies were assessed to have a low risk of bias and low risk of applicability concerns, as shown in Table 3 and illustrated in Figs. 2 and 3. For studies identified to have a moderate or high risk of bias and applicability concerns, the greatest contributing factor was patient selection, followed by index test. This was possibly due to some studies having unclear patient selection criteria, and some studies having controlled test environments, which may bias the index tests. The QUADAS-2 assessment of the included studies showed predominantly low risk of bias and applicability concerns, supporting the reliability of findings on LLMs as transcription tools in medical domains (Figs. 2 and 3). However, some studies, such as Zick et al. [9], Blackley et al. [21], Hodgson et al. [16], and Kodish-Wachs et al. [17], had high risk of bias in patient selection and applicability concerns, which may limit generalizability. Unclear reference standards in studies like Bundy et al. [23] and Goss et al. [20] suggest potential gaps in validation. While the overall low risk in flow and timing strengthens confidence in the results, variability in methodological rigor underscores the need for standardised evaluation for future studies to ensure consistent and reliable conclusions.

Table 3.

Detailed risk of bias assessment for the various studies using QUADAS-2 tool

Study	Risk of bias				Applicability concerns
Study	Patient Selection	Index Test	Reference Standard	Flow and Timing	Patient Selection	Index Test	Reference Standard
Almario et al., 2015 [13]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Almario et al., 2015 [14]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Biro et al., 2025 [32]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Blackley et al., 2020 [21]	HIGH	LOW	LOW	LOW	HIGH	LOW	LOW
Balloch et al., 2024 [4]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Bundy et al., 2024 [23]	LOW	LOW	UNCLEAR	UNCLEAR	LOW	LOW	UNCLEAR
Cao et al., 2024 [24]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Duggan et al., 2025 [33]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Goss et al., 2019 [20]	LOW	UNCLEAR	UNCLEAR	UNCLEAR	LOW	LOW	UNCLEAR
Harbele et al., 2024 [25]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Happe et al., 2003 [10]	UNCLEAR	UNCLEAR	LOW	HIGH	LOW	LOW	LOW
Hodgson et al., 2017 [16]	HIGH	HIGH	LOW	LOW	HIGH	HIGH	LOW
Islam et al., 2024 [26]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Issenman et al., 2004 [12]	HIGH	LOW	LOW	LOW	UNCLEAR	LOW	LOW
Kodish-Wachs et al., 2018 [17]	HIGH	HIGH	LOW	LOW	HIGH	HIGH	LOW
Liu et al., 2024 [27]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Lybarger et al., 2018 [18]	UNCLEAR	LOW	LOW	LOW	LOW	LOW	LOW
Ma et al., 2025 [34]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Misurac et al., 2024 [28]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Mohr et al., 2003 [11]	LOW	LOW	LOW	UNCLEAR	HIGH	LOW	LOW
Moryousef et al., 2025 [35]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Owens et al., 2024 [29]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Sezgin et al., 2024 [30]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Shah et al., 2025 [36]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Suominen et al., 2015 [15]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
van Buchem et al., 2024 [31]	LOW	LOW	LOW	UNCLEAR	LOW	LOW	LOW
Van Woensel et al., 2022 [22]	UNCLEAR	LOW	LOW	LOW	LOW	LOW	LOW
Zhou et al., 2018 [19]	LOW	LOW	LOW	LOW	LOW	LOW	LOW
Zick et al., 2001 [9]	HIGH	HIGH	HIGH	LOW	HIGH	LOW	LOW

Open in a new tab

Fig. 2 — Stacked bar chart displaying risk of bias of the studies reviewed using QUADAS-2 tool

Fig. 3 — Stacked bar chart for applicability concerns based on QUADAS-2 tool

Discussion

Our review identified 29 studies that investigated the applications of ASR and NLP in medical transcriptions across a variety of clinical settings. The included studies spanned environments such as EDs, inpatient wards, specialized clinics (e.g., gastroenterology, psychiatry, and endocrinology) and even simulated scenarios replicating ambulatory primary care workflows. Owing to this diversity and the significant heterogeneity in study designs, sample sizes and performance metrics, making direct comparisons was challenging. Nonetheless, the findings underscore the wide-ranging potential of AI-based transcription technology in healthcare and highlight certain common challenges to overcome in order to advance the field.

A broad array of AI models and software systems emerged from the review, from older ASR-based tools, such as Dragon NaturallySpeaking Medical Suite and Dragon Medical One [9, 12, 15, 16, 20], to more advanced products that incorporate LLMs, including DAX Copilot and GPT-4–driven systems [31, 33, 34]. The newer studies tend to describe ambient AI scribes, which not only transcribe but also summarize and repurpose clinical notes. Such systems may overcome some limitations of standalone SR tools. While these technologies differ in their underlying architectures, their shared aim is to automate or accelerate clinical documentation by converting speech into text, extracting medically relevant content, and in some cases summarizing or repurposing notes for different uses. Such variety in technological approaches was mirrored by the diversity of clinical settings in which these tools were tested. Some studies relied on synthetic data sets, such as nursing handover records, while others evaluated real-world interactions in high-pressure environments like the ED. This variety demonstrates the adaptability of AI transcription solutions but also reveals that performance is heavily context dependent. Tools that excel in structured, repetitive ED workflows may struggle with varied discussions in multi-specialty clinics or with more complex, freeform patient-doctor dialogues. Likewise, whether a study was conducted using real or simulated encounters also influenced performance, as differences in setting and complexity affect metrics such as WER or F1 scores.

In general, the accuracy of AI-driven transcription remains mixed. WER varied from as low as 0.087 in highly controlled settings (Issenman et al. [12]) to over 50% in conversational or multi-speaker encounters (Kodish-Wachs et al. [17]). Some tools achieved more favorable precision/recall in domain-specific contexts, particularly when leveraging specialty vocabularies (e.g., Happe et al. [10], Suominen et al. [15]). However, others (e.g., Lybarger et al. [18]) highlighted persistent transcription errors that require substantial manual correction. It is, however, also important to interpret performance estimates from older systems cautiously, as rapid technological advances—particularly in neural network and transformer-based models—have likely rendered those results outdated or less generalizable to current AI transcription capabilities.

Besides accuracy, studies conveyed mixed evidence concerning time efficiency. Zick et al. and Issenman et al. both reported substantial reductions in documentation turnaround times [9, 12], whereas newer research from Blackley et al. and Hodgson et al. found negligible or even negative impacts once clinicians’ editing tasks were factored in [16, 20]. Similarly, cost analyses yield no consensus. Zick et al. posited that voice recognition could be up to 100 times less expensive than manual transcription [9], but Issenman et al. found it to be more costly in a pediatric gastroenterology context [12]. As these examples illustrate, site-specific factors—such as the prevalence of templated text, local staff costs, and volume of standard phrases—likely determine the effectiveness of AI transcription.

Although AI transcription systems do not directly deliver patient care, they can indirectly influence clinical outcomes by improving documentation completeness and quality. Almario et al. reported a higher identification of red-flag symptoms in AI-drafted notes [13, 14], while others showed that accurate automated transcription may reduce cognitive load on clinicians. Nevertheless, any potential gains are offset by persistent concerns about error rates. High WER or omissions, as highlighted by Kodish-Wachs et al. [17], remain a threat to real-time decision-making, and this problem has not disappeared with the advent of LLM-based scribes, as seen in Bundy et al., van Buchem et al. and Biro et al. [23, 31, 32]. The subsequent post-editing burden also continues to challenge clinicians’ time management, particularly in busy and dynamic outpatient settings with a variety of patient presentations. Moreover, the efficiency gains from AI transcription are not guaranteed and initial investment costs can be prohibitive [37]. Clinicians’ opinion, acceptance and burnout also surfaced as important considerations for AI adoption. Surveys by Goss et al. [20] and interventions assessed by Misurac et al. [28] revealed that while some clinicians appreciate the potential reduction in documentation burdens, many remain cautious, dissatisfied with high error rates, or concerned about the reliability of AI-generated transcripts.

Several recurring challenges in this space require further attention. First, transcription accuracy often degrades with longer or more complex audio, which suggests the need for incremental or real-time correction features. Second, accented or non-native speech frequently leads to transcription mistakes, highlighting a need for accent adaptation or multi-accent training modules [11, 20, 22, 38]. While systems like AEGIS and NOMINDEX demonstrated high accuracy in specific clinical environments [39], their performance may not generalize well across diverse settings, particularly where speech patterns differ. This is especially true in multinational healthcare systems or regions with a high percentage of non-native English speakers. Third, the training of specialty-specific AI models can be hampered by privacy concerns, as clinicians and institutions may be reluctant to share sensitive patient transcripts for model fine-tuning. Fourth, only a minority of tools currently offer robust real-time error correction, meaning that any short-term gains in typing speed may be negated by lengthy revision processes. Beyond technical refinement, the limited pace of AI transcription adoption in healthcare might reflect deep structural barriers, including regulatory scrutiny over patient safety, a fragmented EHR environment that impedes easy integration, and unclear financial incentives. Moreover, as some studies have indicated (e.g., Issenman et al. [12]), frustrated or unreceptive physicians may be unwilling to incorporate new documentation technologies, especially if these tools require significant training or produce large volumes of errors. Further progress will depend on resolving issues of accuracy, accent variability, system interoperability and cost. Future research should also incorporate advanced evaluation metrics, like ROUGE (Recall-Oriented Understudy for Gisting Evaluation) and BLEU (Bilingual Evaluation Understudy) [40], to systematically assess the quality of AI-generated summaries beyond simple WER.

It might be worth mentioning that workflows, already in limited clinical use (e.g., Tortus, Heidi) in the UK since early 2025 [41], may represent the logical next step for AI documentation: integrating LLMs to summarize and repurpose transcripts without requiring pristine input accuracy. Transcription, in this sense, is merely a transitional stage toward more comprehensive AI scribe solutions that ultimately address patient-clinician interactions holistically.

Limitations of review

This review is not without limitations. Firstly, this review searched three major databases (Medline, Embase and Cochrane Library) as well as grey literature but did not consult IEEE Xplore, which is a key database for engineering and technology-related research, including AI and machine learning. This exclusion may have limited the review’s ability to capture relevant studies on AI transcription systems, particularly those focused on technical innovations in SR and NLP (although they may not be applied in health care). Secondly, most of the studies included in this review were short-term evaluations or proof-of-concept studies conducted in controlled environments or with small sample sizes. There is a lack of long-term, real-world data on the sustained use of AI transcription tools in clinical practice. As a result, this review cannot fully assess the long-term impact of AI transcription on clinician efficiency, patient care outcomes or system-wide healthcare improvements. Thirdly, given the narrative synthesis approach, this review lacked the ability to draw strong, statistically powered conclusions about the overall effectiveness of AI transcription tools. Lastly, the review focused primarily on outcomes such as accuracy, time savings and clinician satisfaction, without addressing other potentially important dimensions, such as cost-effectiveness, user training requirements or implementation barriers. These additional factors could significantly affect the adoption and success of AI transcription tools in clinical practice, but they were not consistently reported in the studies reviewed.

Conclusions

In conclusion, this systematic review revealed that AI SR and transcription software has certain potential to improve clinical documentation, enhance workflow efficiency and reduce the documentation burden on clinicians. The tools designed for specific medical domains can achieve high levels of accuracy, as evidenced by systems like AEGIS and NOMINDEX, which outperformed manual documentation. However, there was significant variability in the performance of AI SR and transcription tools across different software platforms and clinical environments, with general-purpose SR systems often producing high error rates and requiring time-consuming manual corrections. This variability highlights that AI transcription software is still in a developmental phase, with much room for refinement, particularly in adapting systems to accents and complex medical language and improving real-time error correction before widespread adoption can be achieved. Future work should also expand the scope beyond transcription alone—exploring end-to-end AI scribe capabilities and evaluating their real-world effectiveness.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(18.9KB, docx)}

Acknowledgements

We thank Dr Clyve Yu Leon Yaow and Mr Ansel Shao Pin Tang for helping to develop and refine the search strategies for this review.

Author contributions

All authors have made substantial contributions to all the following: (1) the conception and design of the study, or acquisition of data, or analysis and interpretation of data, (2) drafting the article or revising it critically for important intellectual content, (3) final approval of the version to be submitted. No writing assistance was obtained in the preparation of the manuscript. The manuscript, including related data, figures and tables has not been previously published, and the manuscript is not under consideration elsewhere. Conceptualization, Design and Methodology: K.X.Z., Q.X.N., Data Curation: K.X.Z., C.X.L.G., G.Z.N.S., S.S.N.G., Q.X.N., J.J.W.N., E.W., X.Z., Formal Analysis: K.X.Z., C.X.L.G., G.Z.N.S., Q.X.N., J.J.W.N., E.W., X.Z., S.S.N.G., H.K.T., Investigation: C.X.L.G., G.Z.N.S., K.X.Z., Q.X.N., J.J.W.N., E.W., X.Z., H.K.T., S.S.N.G., Supervision: H.K.T., S.S.N.G., Q.X.N., Writing– original draft: Q.X.N., J.J.W.N., E.W., X.Z., Writing– review & editing: K.X.Z., H.K.T., S.S.N.G., Q.X.N., J.J.W.N., E.W., X.Z.

Data availability

No datasets were generated or analysed during the current study.

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Joel Jia Wei Ng, Eugene Wang, Xinyan Zhou and Kevin Xiang Zhou contributed equally and should be considered as co-first authors.

References

1.Kuhn T, Basch P, Barr M, Yackel T, Medical Informatics Committee of the American College of Physicians. Clinical Documentation in the 21st century: executive summary of a policy position paper from the American college of physicians. Ann Intern Med. 2015;162(4):301–3. 10.7326/M14-2128. [DOI] [PubMed] [Google Scholar]
2.Moy AJ, Schwartz JM, Chen R, Sadri S, Lucas E, Cato KD, Rossetti SC. Measurement of clinical Documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inf Assoc. 2021;28(5):998–1008. 10.1093/jamia/ocaa325. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gesner E, Dykes PC, Zhang L, Gazarian P. Documentation burden in nursing and its role in clinician burnout syndrome. Appl Clin Inf. 2022;13(5):983–90. 10.1055/s-0042-1757157. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Balloch J, Sridharan S, Oldham G, Wray J, Gough P, Robinson R, Sebire NJ, Khalil S, Asgari E, Tan C, Taylor A, Pimenta D. Use of an ambient artificial intelligence tool to improve quality of clinical Documentation. Future Healthc J. 2024;11(3):100157. 10.1016/j.fhj.2024.100157. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Perkins SW, Muste JC, Alam T, Singh RP. Improving clinical Documentation with artificial intelligence: A systematic review. Perspect Health Inform Manage. 2024;21(2):1–25. [PMC free article] [PubMed] [Google Scholar]
6.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021;372:n71. 10.1136/bmj.n71. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Whiting PF, Rutjes AW, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, Leeflang MM, Sterne JA, Bossuyt PM, QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529–36. 10.7326/0003-4819-155-8-201110180-00009. [DOI] [PubMed] [Google Scholar]
8.Popay J, Roberts H, Sowden A, Petticrew M, Arai L, Rodgers M, Britten N, Roen K, Duffy S. Guidance on the conduct of narrative synthesis in systematic reviews. A product from the ESRC. Methods Programme Version. 2006;1(1):b92. [Google Scholar]
9.Zick RG, Olsen J. Voice recognition software versus a traditional transcription service for physician charting in the ED. Am J Emerg Med. 2001;19(4):295–8. 10.1053/ajem.2001.24487. [DOI] [PubMed] [Google Scholar]
10.Happe A, Pouliquen B, Burgun A, Cuggia M, Le Beux P. Automatic concept extraction from spoken medical reports. Int J Med Inf. 2003;70(2–3):255–63. 10.1016/s1386-5056(03)00055-8. [DOI] [PubMed] [Google Scholar]
11.Mohr DN, Turner DW, Pond GR, Kamath JS, De Vos CB, Carpenter PC. Speech recognition as a transcription aid: a randomized comparison with standard transcription. J Am Med Inf Assoc. 2003 Jan-Feb;10(1):85–93. 10.1197/jamia.m1130. [DOI] [PMC free article] [PubMed]
12.Issenman RM, Jaffer IH. Use of voice recognition software in an outpatient pediatric specialty practice. Pediatrics. 2004;114(3):e290–3. 10.1542/peds.2003-0724-L. [DOI] [PubMed] [Google Scholar]
13.Almario CV, Chey W, Kaung A, Whitman C, Fuller G, Reid M, Nguyen K, Bolus R, Dennis B, Encarnacion R, Martinez B, Talley J, Modi R, Agarwal N, Lee A, Kubomoto S, Sharma G, Bolus S, Chang L, Spiegel BM. Computer-generated vs. physician-documented history of present illness (HPI): results of a blinded comparison. Am J Gastroenterol. 2015;110(1):170–9. 10.1038/ajg.2014.356. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Almario CV, Chey WD, Iriana S, Dailey F, Robbins K, Patel AV, Reid M, Whitman C, Fuller G, Bolus R, Dennis B, Encarnacion R, Martinez B, Soares J, Modi R, Agarwal N, Lee A, Kubomoto S, Sharma G, Bolus S, Chang L, Spiegel BM. Computer versus physician identification of Gastrointestinal alarm features. Int J Med Inf. 2015;84(12):1111–7. 10.1016/j.ijmedinf.2015.07.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Suominen H, Johnson M, Zhou L, Sanchez P, Sirel R, Basilakis J, Hanlen L, Estival D, Dawson L, Kelly B. Capturing patient information at nursing shift changes: methodological evaluation of speech recognition and information extraction. J Am Med Inf Assoc. 2015;22(e1):e48–66. 10.1136/amiajnl-2014-002868. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Hodgson T, Magrabi F, Coiera E. Efficiency and safety of speech recognition for Documentation in the electronic health record. J Am Med Inf Assoc. 2017;24(6):1127–33. 10.1093/jamia/ocx073. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Kodish-Wachs J, Agassi E, Kenny P 3rd, Overhage JM. A systematic comparison of contemporary automatic speech recognition engines for conversational clinical speech. AMIA Annu Symp Proc. 2018;2018:683–689. [PMC free article] [PubMed]
18.Lybarger K, Ostendorf M, Yetisgen M. Automatically detecting likely edits in clinical notes created using automatic speech recognition. AMIA Annu Symp Proc. 2018;2017:1186–95. [PMC free article] [PubMed] [Google Scholar]
19.Zhou L, Blackley SV, Kowalski L, Doan R, Acker WW, Landman AB, Kontrient E, Mack D, Meteer M, Bates DW, Goss FR. Analysis of errors in dictated clinical documents assisted by speech recognition software and professional transcriptionists. JAMA Netw Open. 2018;1(3):e180530. 10.1001/jamanetworkopen.2018.0530. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Goss FR, Blackley SV, Ortega CA, Kowalski LT, Landman AB, Lin CT, Meteer M, Bakes S, Gradwohl SC, Bates DW, Zhou L. A clinician survey of using speech recognition for clinical Documentation in the electronic health record. Int J Med Inf. 2019;130:103938. 10.1016/j.ijmedinf.2019.07.017. [DOI] [PubMed] [Google Scholar]
21.Blackley SV, Schubert VD, Goss FR, Al Assad W, Garabedian PM, Zhou L. Physician use of speech recognition versus typing in clinical documentation: A controlled observational study. Int J Med Inf. 2020;141:104178. 10.1016/j.ijmedinf.2020.104178. [DOI] [PubMed] [Google Scholar]
22.Van Woensel W, Taylor B, Abidi SSR. Towards an adaptive clinical transcription system for In-Situ transcribing of patient encounter information. Stud Health Technol Inf. 2022;290:158–62. 10.3233/SHTI220052. [DOI] [PubMed] [Google Scholar]
23.Bundy H, Gerhart J, Baek S, Connor CD, Isreal M, Dharod A, Stephens C, Liu TL, Hetherington T, Cleveland J. Can the administrative loads of physicians be alleviated by AI-Facilitated clinical documentation? J Gen Intern Med. 2024;39(15):2995–3000. 10.1007/s11606-024-08870-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Cao DY, Silkey JR, Decker MC, Wanat KA. Artificial intelligence-driven digital scribes in clinical documentation: pilot study assessing the impact on dermatologist workflow and patient encounters. JAAD Int. 2024;15:149–51. 10.1016/j.jdin.2024.02.009. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Haberle T, Cleveland C, Snow GL, Barber C, Stookey N, Thornock C, Younger L, Mullahkhel B, Ize-Ludlow D. The impact of nuance DAX ambient listening AI documentation: a cohort study. J Am Med Inf Assoc. 2024;31(4):975–9. 10.1093/jamia/ocae022. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Islam MN, Mim ST, Tasfia T, Hossain MM. Enhancing patient treatment through automation: the development of an efficient scribe and prescribe system. Inf Med Unlocked. 2024;45:101456. [Google Scholar]
27.Liu TL, Hetherington TC, Stephens C, McWilliams A, Dharod A, Carroll T, Cleveland JA. AI-Powered clinical Documentation and clinicians’ electronic health record experience: A nonrandomized clinical trial. JAMA Netw Open. 2024;7(9):e2432460. 10.1001/jamanetworkopen.2024.32460. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Misurac J, Knake LA, Blum JM. The effect of ambient artificial intelligence notes on provider burnout. Appl Clin Inf. 2024. 10.1055/a-2461-4576. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Owens LM, Wilda JJ, Grifka R, Westendorp J, Fletcher JJ. Effect of ambient voice technology, natural Language processing, and artificial intelligence on the Patient-Physician relationship. Appl Clin Inf. 2024;15(4):660–7. 10.1055/a-2337-4739. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Sezgin E, Sirrianni JW, Kranz K. Evaluation of a digital scribe: conversation summarization for emergency department consultation calls. Appl Clin Inf. 2024;15(3):600–11. 10.1055/a-2327-4121. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.van Buchem MM, Kant IMJ, King L, Kazmaier J, Steyerberg EW, Bauer MP. Impact of a digital scribe system on clinical Documentation time and quality: usability study. JMIR AI. 2024;3:e60020. 10.2196/60020. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Biro J, Handley JL, Cobb NK, Kottamasu V, Collins J, Krevat S, Ratwani RM. Accuracy and safety of AI-Enabled scribe technology: instrument validation study. J Med Internet Res. 2025;27:e64993. 10.2196/64993. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Duggan MJ, Gervase J, Schoenbaum A, Hanson W, Howell JT 3rd, Sheinberg M, Johnson KB. Clinician experiences with ambient scribe technology to assist with Documentation burden and efficiency. JAMA Netw Open. 2025;8(2):e2460637. 10.1001/jamanetworkopen.2024.60637. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Ma SP, Liang AS, Shah SJ, Smith M, Jeong Y, Devon-Sand A, Crowell T, Delahaie C, Hsia C, Lin S, Shanafelt T, Pfeffer MA, Sharp C, Garcia P. Ambient artificial intelligence scribes: utilization and impact on Documentation time. J Am Med Inf Assoc. 2025;32(2):381–5. 10.1093/jamia/ocae304. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Moryousef J, Nadesan P, Uy M, Matti D, Guo Y. Assessing the efficacy and clinical utility of artificial intelligence scribes in urology. Urology. 2025;196:12–7. 10.1016/j.urology.2024.11.061. [DOI] [PubMed] [Google Scholar]
36.Shah SJ, Devon-Sand A, Ma SP, Jeong Y, Crowell T, Smith M, Liang AS, Delahaie C, Hsia C, Shanafelt T, Pfeffer MA, Sharp C, Lin S, Garcia P. Ambient artificial intelligence scribes: physician burnout and perspectives on usability and Documentation burden. J Am Med Inf Assoc. 2025;32(2):375–80. 10.1093/jamia/ocae295. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Joseph J, Moore ZEH, Patton D, O’Connor T, Nugent LE. The impact of implementing speech recognition technology on the accuracy and efficiency (time to complete) clinical Documentation by nurses: A systematic review. J Clin Nurs. 2020;29(13–14):2125–37. 10.1111/jocn.15261. [DOI] [PubMed] [Google Scholar]
38.Koenecke A, Nam A, Lake E, Nudell J, Quartey M, Mengesha Z, Toups C, Rickford JR, Jurafsky D, Goel S. Racial disparities in automated speech recognition. Proc Natl Acad Sci U S A. 2020;117(14):7684–9. 10.1073/pnas.1915768117. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Blackley SV, Huynh J, Wang L, Korach Z, Zhou L. Speech recognition for clinical Documentation from 1990 to 2018: a systematic review. J Am Med Inf Assoc. 2019;26(4):324–38. 10.1093/jamia/ocy179. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Gardner N, Khan H, Hung C-C. Definition modeling: literature review and dataset analysis. Appl. Comput. Intell. 2022;2:83–98. 10.3934/aci.2022005.
41.Lawton J. NHS AI trial hailed as ‘remarkable’ and most ‘transformative’ tech in 15 years [Internet]. 2025 [cited 2025 Mar 10]. Available from: https://www.dailystar.co.uk/news/latest-news/nhs-ai-trial-hailed-remarkable-34423254

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(18.9KB, docx)}

Data Availability Statement

No datasets were generated or analysed during the current study.

PERMALINK

Evaluating the performance of artificial intelligence-based speech recognition for clinical documentation: a systematic review

Joel Jia Wei Ng

Eugene Wang

Xinyan Zhou

Kevin Xiang Zhou

Charlene Xing Le Goh

Gabriel Zheng Ning Sim

Hiang Khoon Tan

Serene Si Ning Goh

Qin Xiang Ng

Abstract

Background

Methods

Results

Conclusions

Clinical trial number

Supplementary Information

Background

Methods

Search strategy

Inclusion and exclusion criteria

Study selection

Data extraction

Assessment of risk of bias and study quality

Data synthesis

Results

Literature retrieval

Fig. 1.

Table 1.

Table 2.

Study results

Key findings

Accuracy and error rates

Workflow efficiency and time savings

Cost implications

Clinical documentation quality and patient care

Clinician satisfaction, burnout and adoption

Risk of bias and study quality

Table 3.

Fig. 2.

Fig. 3.

Discussion

Limitations of review

Conclusions

Electronic supplementary material

Acknowledgements

Author contributions

Data availability

Declarations

Competing interests

Footnotes

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases