Cohere Transcribe: Speech Recognition

My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive.

In OCR, even when the characters are poorly scanned, the deep domain understanding these large multi modal AIs have allows it to understand what the document actually meant - this is going to be order id because in the million invoices I have seen before order id is normally below order date - etc. The same issue is going to be there in ASR also is my worry.

The problem with many STT models is that they seem to mostly be trained on perfectly-accented speech and struggle a lot with foreign accents so I’m curious to try this one as a Frenchman with a rather French English accent.

So far, the best I have found while testing models for my language learning app (Copycat Cafe) is Soniox. All others performed badly for non native accents. The worst were whisper-based models because they hallucinate when they misunderstand and tend to come up with random phrases that have nothing to do with the topic.

> Limitations

>Timestamps/Speaker diarization. The model does not feature either of these.

What a shame. Is whisperx still the best choice if you want timestamps/diarization?

Unfortunately, this model does not seem to support a custom vocabulary, word boosting or an additional prompt.

I can't say enough nice things about Cohere's services. I migrated over to their embedding model a few months ago for clip-style embeddings and it's been fantastic.

It has the most crisp, steady P50 of any external service I've used in a long time.

To clarify, this is SOTA in its size category, right? It's not better than Parakeet, for example?

Dumb question, but if this is "open source" is there source code somewhere? Or does that term mean something different in the world of models that must be trained to be useful?

I had to set-up fireflies for our company recently. Cool tool, but I'm sending dozens of internal meetings to an american company. Our ISO inspector wouldn't be pleased to know.

This is a good option. Will check it out.

Just today I shipped support for this in Whisper Memos: https://whispermemos.com/changelog/2026-04-cohere-transcribe

Accurate and fast model, very happy with it so far!

How hard could it be to train other European language(-s)?

It's great that this is Apache 2.0 licensed - several of Cohere's other models are licensed free for non-commercial use only.

Multimodels are way better

Unfortunately, this model does not seem to support a custom vocabulary, word boosting or an additional prompt.

Just today I shipped support for this in Whisper Memos: https://whispermemos.com/changelog/2026-04-cohere-transcribe

Accurate and fast model, very happy with it so far!

It's great that this is Apache 2.0 licensed - several of Cohere's other models are licensed free for non-commercial use only.

My worry is that ASR will end up like OCR. If the multi modal large AI system is good enough (latency wise), the advantage of domain understanding eats the other technlogies alive.

This is both good and bad. Good ASR can often understand low quality / garbled speech that I could not figure out, but it also "over corrects" sometimes and replaces correct but low prior words with incorrect but much more common ones.

With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!

(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

This is exactly the case today. Multimodal LLMs like gpt-4o-transcribe are way better than traditional ASR, not only because of deeper understanding but because of the ability to actually prompt it with your company's specific terminology, org chart, etc.

For example, if the prompt includes that Caitlin is an accountant and Kaitlyn is an engineer, if you transcribe "Tell Kaitlyn to review my PR" it will know who you're referring to. That's something WER doesn't really capture.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe

Why are you 'worried' about it? Shouldn't we strive for better technology even if it means some will 'lose'?

Dumb question, but if this is "open source" is there source code somewhere? Or does that term mean something different in the world of models that must be trained to be useful?

Files can be downloaded here: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/...

And someone has already converted it to onnx format: https://huggingface.co/eschmidbauer/cohere-transcribe-03-202... - so it can be run on CPU instead of GPU.

Most use definition is just awailable weigths.

This kids make sense because "compiling" (training) the model cost inhibitly much, and we can still benefit from the artifacts.

I presume it means the model itself.

I can't say enough nice things about Cohere's services. I migrated over to their embedding model a few months ago for clip-style embeddings and it's been fantastic.

It has the most crisp, steady P50 of any external service I've used in a long time.

can u comment on overall quality? their models tend to be a bit smaller and less performant overall.

I had to set-up fireflies for our company recently. Cool tool, but I'm sending dozens of internal meetings to an american company. Our ISO inspector wouldn't be pleased to know.

This is a good option. Will check it out.

There are many open source STT models that can run locally on Mac with good performance, such as whisper and Parakeet

> Limitations

>Timestamps/Speaker diarization. The model does not feature either of these.

What a shame. Is whisperx still the best choice if you want timestamps/diarization?

Multimodels are way better

To clarify, this is SOTA in its size category, right? It's not better than Parakeet, for example?

Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.

My experiences with Google’s Chirp have been horrendous, with it sometimes skipping sections of speech entirely, hallucinating speech where the audio contains noise, and unreliable word level timestamps. And this all is even with using their new audio prefiltering feature.

AWS works slightly better, but also has trouble with keeping word level timestamps in sync.

Whisper is nice but hallucinates regularly.

OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…

A lot of this could be worked around by sending the resulting transcripts through a few layers of post processing, but… I just want to pay for an API that is reliable and saves me from doing all that work.

WhisperX is not a model but a software package built around Whisper and some other models, including diarization and alignment ones. Something similar will be built around the Cohere Transcribe model, maybe even just an integration to WhisperX itself.

I would try Qwen-ASR: https://qwen.ai/blog?id=qwen3asr

See the very bottom of the page for a transcription with timestamps.

There is also: https://github.com/linto-ai/whisper-timestamped

It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.

Can you clarify? I tested a few and they are rubbish and don't have the same features.

Looking at the ASR leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), Parakeet (.6B) is still near the top on speed, but about 10th on WER.

Well, to clarify, it is both larger than parakeet in parameter count (parakeet is available in 0.6B and 1.1B), since it's 2B params, and also performs better than it on the benchmarks that hugging face publishes on the openASR leaderboard

How hard could it be to train other European language(-s)?

If you have to ask you dont really need the answer.

Seems to not be to difficult in finding or creating training code. So a pretty decent amount of high quality training data should be many hours. And a few hours in high end data enter GPU compute, and many iterations to get it right.

It includes several European languages.

BTW, I built an open-source Mac tool for using gpt-4o-transcribe with an OpenAI API key and custom prompts: https://github.com/corlinp/voibe

With OCR the risk is you get another xerox[1] incident where all your data looks plausible but is incorrect. Hope you kept the originals!

(This is why for my personal doc scans, I use OCR only for full text search, but retain the original raw scans forever)

[1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

Why are you 'worried' about it? Shouldn't we strive for better technology even if it means some will 'lose'?

"Better" isn't just about increasing benchmark numbers. Often, it's more important that a system fails safely than how often it fails. Automatic speech recognition that guesses when the input is unclear will occasionally be right and therefore have a lower word error rate, but if it's important that the output be correct, it might be better to insert "[unintelligible]" and have a human double-check.

Ideally, you'd be able to specify exactly what you want - do you want to write-out filled pauses ("aaah", "umm")? Do you want to get a transcription of the the disfluencies - re-starts, etc. or just get out a cleaned up version?

It's better in terms of WER. It's not better in terms of not making shit up that sounds plausible.

Probably the answer is simply to tweak the metric so it's a bit more smart than WER - allow "unclear" output which is penalised less than actually incorrect answers. I'd be surprised if nobody has done that.

I would try Qwen-ASR: https://qwen.ai/blog?id=qwen3asr

See the very bottom of the page for a transcription with timestamps.

Even in the commercial space, there’s a lack of production grade ASR APIs that support diarization and word level timestamps.

AWS works slightly better, but also has trouble with keeping word level timestamps in sync.

Whisper is nice but hallucinates regularly.

OpenAI’s new transcription models are delivering accurate output but do not support word level timestamps…

Isn't Elevenlabs the best in this?

If you have to ask you dont really need the answer.

Most use definition is just awailable weigths.

This kids make sense because "compiling" (training) the model cost inhibitly much, and we can still benefit from the artifacts.

Files can be downloaded here: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026/...

And someone has already converted it to onnx format: https://huggingface.co/eschmidbauer/cohere-transcribe-03-202... - so it can be run on CPU instead of GPU.

There are many open source STT models that can run locally on Mac with good performance, such as whisper and Parakeet

I presume it means the model itself.

can u comment on overall quality? their models tend to be a bit smaller and less performant overall.

My baseline was Jina, A Chinese model provider. I had major issues with their reliability. I have no comparison to provide in terms of offline metrics as I had to do an emergency migration because their inference service has extended downtimes.

My experience with Cohere and interacting with their sales engineers has been boring, I say that is the most flattering way possible. Embeddings are a core service at this point like VMs and DBs. They just need to work and work well and thats what they're selling.

It includes several European languages.

Ahh thanks, I confused my parameter count, thanks. I guess Parakeet is 0.6B, I was somehow thinking 6B.

Isn't Elevenlabs the best in this?

Looking at the ASR leaderboard (https://huggingface.co/spaces/hf-audio/open_asr_leaderboard), Parakeet (.6B) is still near the top on speed, but about 10th on WER.

Thanks, I don't know how much to trust benchmarks so I figured I'd ask.

There is also: https://github.com/linto-ai/whisper-timestamped

It doesn't use an extra model (so it supports every language that works with Whisper out of the box and use less memory), it works by applying Dynamic Time Warping to cross-attention weights.

Ahh thanks, I confused my parameter count, thanks. I guess Parakeet is 0.6B, I was somehow thinking 6B.

Just a warning that plain WhisperX is more accurate and Whisper-timestamped has many weird quirks.

Can you clarify? I tested a few and they are rubbish and don't have the same features.

It's better in terms of WER. It's not better in terms of not making shit up that sounds plausible.

Thanks, I don't know how much to trust benchmarks so I figured I'd ask.

Just a warning that plain WhisperX is more accurate and Whisper-timestamped has many weird quirks.

Cohere is announcing Transcribe, a state-of-the-art automatic speech recognition (ASR) model that is open source and available today for download.

Speech is rapidly becoming a core modality for AI-enabled workloads and automations — from meeting transcription and speech analytics to real-time customer support agents.

Our objective was straightforward: push the frontier of dedicated ASR model accuracy under practical conditions. The model was trained from scratch with a deliberate focus on minimizing word error rate (WER), while keeping production readiness top-of-mind. In other words, not just a research artifact, but a system designed for everyday use.

Cohere Transcribe reflects that intent. It is available for open-source use with full infrastructure control, maintains a manageable inference footprint suitable for practical GPU and local utilization, delivers best-in-class serving efficiency, and is also available via Model Vault — Cohere’s secure, fully managed model inference platform.

Cohere Transcribe currently ranks #1 for accuracy on HuggingFace’s Open ASR Leaderboard, setting a new benchmark for real-world transcription performance.

This marks our zero-to-one in bringing high-performance speech recognition into enterprise AI workflows. Read on to learn more.

Model overview

Name	cohere-transcribe-03-2026
Architecture	conformer-based encoder-decoder
Input	audio waveform → log-Mel spectrogram
Output	transcribed text
Model size	2B
Model	a large Conformer encoder extracts acoustic representations, followed by a lightweight Transformer decoder for token generation
Training objective	standard supervised cross-entropy on output tokens; trained from scratch
Languages	trained on 14 languages: European: English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish AIPAC: Chinese (Mandarin), Japanese, Korean, Vietnamese MENA: Arabic
License	Apache 2.0

Image 1: Cohere Transcribe is an open-weights Conformer ASR model converting speech audio into text across 14 supported languages.

Model performance

Accuracy

Cohere Transcribe is the latest standard for English speech recognition accuracy. It leads the HuggingFace Open ASR Leaderboard with an average word error rate of just 5.42%, outperforming all open- and closed-source dedicated ASR alternatives, including Whisper Large v3, ElevenLabs Scribe v2, and Qwen3-ASR-1.7B. This captures the model’s versatile capability across real-world speech tasks, such as robustness to multiple-speaker environments, boardroom-style acoustics (e.g. AMI dataset), and diverse accents (e.g. Voxpopuli dataset).

Model	Average WER	AMI	Earnings 22	Gigaspeech	LS clean	LS other	SPGISpeech	Tedlium	Voxpopuli
Cohere Transcribe	5.42	8.13	10.86	9.34	1.25	2.37	3.08	2.49	5.87
Zoom Scribe v1	5.47	10.03	9.53	9.61	1.63	2.81	1.59	3.22	5.37
IBM Granite 4.0 1B Speech	5.52	8.44	8.48	10.14	1.42	2.85	3.89	3.10	5.84
NVIDIA Canary Qwen 2.5B	5.63	10.19	10.45	9.43	1.61	3.10	1.90	2.71	5.66
Qwen3-ASR-1.7B	5.76	10.56	10.25	8.74	1.63	3.40	2.84	2.28	6.35
ElevenLabs Scribe v2	5.83	11.86	9.43	9.11	1.54	2.83	2.68	2.37	6.80
Kyutai STT 2.6B	6.40	12.17	10.99	9.81	1.70	4.32	2.03	3.35	6.79
OpenAI Whisper Large v3	7.44	15.95	11.29	10.02	2.01	3.91	2.94	3.86	9.54
Voxtral Mini 4B Realtime 2602	7.68	17.07	11.84	10.38	2.08	5.52	2.42	3.79	8.34

Image 2: the Hugging Face Open ASR Leaderboard as of 03.26.2026. This is a widely used, standardized benchmark evaluating automatic speech recognition systems across curated datasets using word error rate (WER) as the primary metric, computed over normalized reference-hypothesis alignments, where lower WER indicates higher transcription fidelity. See the live leaderboard here.

Critically, these gains aren’t limited to benchmark datasets. We see the same state-of-the-art performance carried over into human evaluations, where trained reviewers assess transcription quality across real-world audio for accuracy, coherence, and usability. Consistency across both evaluation methods reinforces that Cohere Transcribe’s performance translates reliably from controlled tests to practical enterprise settings.

Bar chart showing transcription win rates (%) by model: ElevenLabs Scribe v2 (51%), Qwen3-ASR-1.7B (55%), Voxtral Mini 3B Realtime 2507 (55%), Zoom Scribe v1 (56%), OpenAI Whisper Large v3 (64%), NVIDIA Canary Qwen 2.5B (67%), IBM Granite 4.0 1B Speech (78%), with an average of 61%.

Image 3: human preference evaluation of model transcripts in English. In a pairwise comparison, annotators were asked to express preferences for generations which primarily preserved meaning - but also avoided hallucination, correctly identified named entities, and provided verbatim transcripts with appropriate formatting. A score of 50% or higher indicates that Cohere Transcribe was preferred on average in the head-to-head comparison.

Bar chart showing transcription win rates (%) for three ASR models—Qwen3-ASR-1.7B, OpenAI Whisper Large v3, and Voxtral Mini 4B Realtime—across six languages: Italian (60%, 55%, 58%), French (51%, 51%, 54%), German (44%, 52%, 49%), Spanish (48%, 52%, 43%), Portuguese (48%, 41%, 40%), and Japanese (70%, 66%, 64%).

Image 4: human evaluation of ASR accuracy for a selection of supported languages. A score of 50% or higher indicates that Cohere Transcribe was preferred on average in the head-to-head comparison.

Throughput

In production settings, ASR systems must operate under strict latency and throughput constraints; even if accurate, slow or resource-intensive transcription can directly impact user experience, operational efficiency, and cost.

Transcribe extends the Pareto frontier, delivering state-of-the-art accuracy (low WER) while sustaining best-in-class throughput (high RTFx) within the 1B+ parameter model cohort.

Scatter plot comparing seven ASR models by word error rate (accuracy, lower is better) versus throughput. Cohere Transcribe, NVIDIA Canary Qwen 2.5B, and IBM Granite show higher throughput at lower error rates, while Whisper Large v3 and Voxtral Realtime have higher error rates with lower throughput.

Image 5: throughput (RTFx) vs accuracy (WER) plot for leading models larger than 1B in size. RTFx (real-time factor multiple) measures how fast an audio model processes its input relative to real time.

“We’re genuinely impressed with what Cohere has built with Transcribe. The speed is exceptional — turning minutes of audio into usable transcripts in seconds — and it immediately unlocks new possibilities for real-time products and workflows.

In our testing, the model handled everyday speech very well and delivered strong, reliable transcription quality. The overall experience has been smooth and easy to work with. We’re excited to be partnering with Cohere and to continue exploring what we can build with this technology.”

Paige Dickie Vice-President Radical Ventures

Zero to one, and beyond.

We are working towards deeper integration of Cohere Transcribe with North, Cohere’s AI agent orchestration platform. With planned updates, Cohere Transcribe will evolve from a high-accuracy transcription model into a broader foundation for enterprise speech intelligence.

Getting started.

Cohere Transcribe is now available for download on Hugging Face. Follow the setup instructions to run the model locally, or even in edge environments.

You can also access Cohere Transcribe via our API for free, low-setup experimentation subject to rate limits. See the documentation for usage details and integration guidance.

For production deployment without rate limits, provision a dedicated Model Vault. This enables low-latency, private cloud inference without having to manage infrastructure. Pricing is calculated per hour-instance, with discounted plans for longer-term commitments. Contact our team to discuss your requirements.

Key contributors: Julian Mack (Member of Technical Staff), Ekagra Ranjan (Member of Technical Staff), Cassie Cao (Product Manager), Bharat Venkitesh (Manager of Technical Staff), Pierre Harvey Richemond (Manager of Technical Staff).

Hacker Times