Benchmarks for AssemblyAI’s Speech-to-text Models
Benchmarks are an important first step of any Speech-to-text evaluation. Below we cover the current benchmarks of our models so you can assess if you should run an evaluation.
Our most up to date English benchmarks are included below. Most recent update was October 2025.
Our most up to date Multilingual benchmarks are included below. Most recent update was June 2025.
The dataset used for this benchmark was the FLEURS dataset, a commonly used multilingual audio dataset.
Models are often trained on publicly available datasets—sometimes the very same datasets used for evaluation.
When this happens, the model becomes overfit to the evaluation set and will show artificially strong performance on standard WER tests. This makes WER potentially misleading, as real-world performance on unseen audio will be significantly worse than performance on audio the model encountered during training.
Many models are now trained on the same datasets used for popular benchmarks, allowing developers to inflate their reported performance through overfitting. This is why we strongly recommend running evaluations on your own datasets to identify the best model for your specific use case.
If you wish to check out third-party benchmarks for pre-recorded audio, we’d recommend the Hugging Face ASR Leaderboard.
There are many models listed on leaderboards that are purely for transcription and may require self-hosting and/or additional features like speaker diarization, automatic language detection, etc. to perform like AssemblyAI in production.
Our most up to date English benchmarks are included below. Most recent update was October 2025.
Our most up to date Multilingual benchmarks are included below. Most recent update was October 2025.
In the streaming space, speed is everything. To achieve lower TTFT (time to first token) metrics, some providers emit tokens into the stream before any audio is actually spoken. This creates the appearance of a faster model, but these early tokens are hallucinations designed to game the benchmark.
In this scenario, TTFT becomes a misleading measure of latency. When you stream real audio into the model, getting an accurate first token will be much slower than the benchmark suggests.
Due to TTFT/TTFB gaming and overfitting, many models intentionally do things like emit hallucinated tokens to appear “fastest” on leaderboards. This is why we highly recommend running your own evaluations on your own datasets to get the best model for your use case.
Models are often trained on publicly available datasets—sometimes the very same datasets used for evaluation.
When this happens, the model becomes overfit to the evaluation set and will show artificially strong performance on standard WER tests. This makes WER potentially misleading, as real-world performance on unseen audio will be significantly worse than performance on audio the model encountered during training.
Many models are now trained on the same datasets used for popular benchmarks, allowing developers to inflate their reported performance through overfitting. This is why we strongly recommend running evaluations on your own datasets to identify the best model for your specific use case.
For third-party benchmarks, we’d recommend the Coval Speech-to-Text Playground.
We’d be happy to help! AssemblyAI has a benchmarking tool to help you run a custom evaluation against your real audio files. Feel free to contact us for more information.
You can also run your own benchmarks following the Hugging Face framework which provides a GitHub repo with full instructions.