[go: up one dir, main page]

Benchmarks for AssemblyAI’s Speech-to-text Models

Benchmarks are an important first step of any Speech-to-text evaluation. Below we cover the current benchmarks of our models so you can assess if you should run an evaluation.

Pre-recorded Speech-to-text English Benchmarks

Our most up to date English benchmarks are included below. Most recent update was October 2025.

DatasetWER (%)Hallucination Rate (%)
Overall PerformanceMean: 6.2% | Median: 6.5%0.58%
commonvoice6.51%-
earnings219.44%-
librispeech_test_clean1.88%-
librispeech_test_other3.10%-
meanwhile4.48%-
tedlium7.28%-
rev1610.42%-
Multilingual Benchmarks

Our most up to date Multilingual benchmarks are included below. Most recent update was June 2025.

The dataset used for this benchmark was the FLEURS dataset, a commonly used multilingual audio dataset.

Language CodeLanguageWER (%)
AverageAll Languages6.76%
deGerman4.99%
enEnglish4.38%
esSpanish2.95%
fiFinnish10.10%
frFrench7.71%
hiHindi7.38%
itItalian3.29%
jaJapanese7.79%
koKorean14.54%
nlDutch7.79%
plPolish6.63%
ptPortuguese4.80%
ruRussian5.80%
trTurkish8.12%
ukUkrainian7.42%
viVietnamese9.75%
Common Benchmark Challenges with Pre-recorded audio Benchmark Gaming / Overfitting

Models are often trained on publicly available datasets—sometimes the very same datasets used for evaluation.

When this happens, the model becomes overfit to the evaluation set and will show artificially strong performance on standard WER tests. This makes WER potentially misleading, as real-world performance on unseen audio will be significantly worse than performance on audio the model encountered during training.

Many models are now trained on the same datasets used for popular benchmarks, allowing developers to inflate their reported performance through overfitting. This is why we strongly recommend running evaluations on your own datasets to identify the best model for your specific use case.

External Benchmarks

If you wish to check out third-party benchmarks for pre-recorded audio, we’d recommend the Hugging Face ASR Leaderboard.

There are many models listed on leaderboards that are purely for transcription and may require self-hosting and/or additional features like speaker diarization, automatic language detection, etc. to perform like AssemblyAI in production.

Streaming Speech-to-text English Benchmarks

Our most up to date English benchmarks are included below. Most recent update was October 2025.

DatasetWER (%)Emission Latency (ms)
Overall PerformanceMean: 8.5% | Median: 7.8%Median: 256.41ms | P90: 579ms
commonvoice11.81%-
earnings2112.37%-
librispeech_test_clean2.71%-
librispeech_test_other5.82%-
meanwhile6.73%-
tedlium7.81%-
rev1612.99%-
Multilingual Benchmarks

Our most up to date Multilingual benchmarks are included below. Most recent update was October 2025.

Language CodeLanguageWER (%)Emission Latency (ms)
AverageAll Languages11.58%Median: 451ms | P90: 669ms
enEnglish12.94%-
esSpanish9.81%-
deGerman13.99%-
frFrench16.53%-
itItalian7.36%-
ptPortuguese9.83%-
Common Benchmark Challenges in Streaming TTFT / TTFB Latency Gaming

In the streaming space, speed is everything. To achieve lower TTFT (time to first token) metrics, some providers emit tokens into the stream before any audio is actually spoken. This creates the appearance of a faster model, but these early tokens are hallucinations designed to game the benchmark.

In this scenario, TTFT becomes a misleading measure of latency. When you stream real audio into the model, getting an accurate first token will be much slower than the benchmark suggests.

Due to TTFT/TTFB gaming and overfitting, many models intentionally do things like emit hallucinated tokens to appear “fastest” on leaderboards. This is why we highly recommend running your own evaluations on your own datasets to get the best model for your use case.

Benchmark Gaming / Overfitting

Models are often trained on publicly available datasets—sometimes the very same datasets used for evaluation.

When this happens, the model becomes overfit to the evaluation set and will show artificially strong performance on standard WER tests. This makes WER potentially misleading, as real-world performance on unseen audio will be significantly worse than performance on audio the model encountered during training.

Many models are now trained on the same datasets used for popular benchmarks, allowing developers to inflate their reported performance through overfitting. This is why we strongly recommend running evaluations on your own datasets to identify the best model for your specific use case.

External Benchmarks

For third-party benchmarks, we’d recommend the Coval Speech-to-Text Playground.

Want to run a benchmark?

We’d be happy to help! AssemblyAI has a benchmarking tool to help you run a custom evaluation against your real audio files. Feel free to contact us for more information.

You can also run your own benchmarks following the Hugging Face framework which provides a GitHub repo with full instructions.