TTS Benchmark 2026: What 44,000 Human Votes Reveal About Where Text-to-Speech Models Actually Fail

Executive Summary

Voice AI has improved dramatically. Most frontier text-to-speech systems now sound natural enough that listeners often struggle to distinguish between them in straightforward scenarios.

But sounding human and being trusted are not the same thing.

Gemini 3.1 Flash TTS achieved the highest overall win rate at 86.9%, with consistent dominance across all 6 languages and all 4 use case categories.
Nearly 40% of all evaluations ended in ties — suggesting increasing convergence among frontier models on basic voice quality.
xAI Grok TTS showed substantially higher hallucination rates than competing systems — approximately 3.9× Gemini's rate — particularly outside English.
Microsoft Azure Dragon HD Omni performed competitively in English (48.6% win rate) but collapsed in Japanese (12.6%) — a 36 percentage point gap across the same model.
Structured content — numbers, booking IDs, tracking codes, payment references — produced the highest defect rates across all models and all languages.
Severe mispronunciation is the most damaging defect (5.1% win rate when present). Mild mispronunciation is the most forgiven (18.1%). Human listeners penalize reliability failures far more than stylistic imperfections.

The frontier challenge in voice AI is no longer naturalness. It is reliability.

The Evaluation At A Glance

Metric	Value
Total human votes	44,387
Models evaluated	9
Languages covered	English, Hindi, Arabic, Japanese, Portuguese, Vietnamese
Use case categories	4
Votes with written comments	21,756 (49.0%)
Matchups ending in a tie	17,682 (39.8%)
Total defect flags raised	67,263

Introduction

If you are building a text-to-speech model today, you already know the standard benchmark numbers. You have seen automated scores. You have run internal evaluations. You know how your model performs on the metrics machines can measure.

But many of the failures that matter most to users are invisible in those metrics. A benchmark can tell you whether two audio clips are acoustically similar. It cannot tell you that a customer-support voice added a word that was never present in the script. It cannot tell you that a booking ID became difficult to understand in Japanese. It cannot tell you that a model sounds authoritative in English but robotic in Hindi.

That is exactly what VoiceArena's evaluation data captures. Across 44,387 human pairwise evaluations covering 9 leading TTS models and 6 languages, evaluators did more than select winners — nearly half wrote comments explaining their decisions in plain language. The result is one of the most detailed human-preference datasets available for multilingual text-to-speech evaluation.

What emerges is a clear trend: the gap between leading models is increasingly defined not by voice quality alone, but by reliability. The strongest systems consistently preserve meaning, pronunciation, pacing, and content fidelity across languages. The weaker systems fail in ways that are specific, measurable, and often surprisingly consistent.

How VoiceArena Evaluates

Most automated TTS benchmarks measure things machines can measure — word error rate, pitch consistency, spectral similarity. These are useful. They are also insufficient. They cannot capture what a Japanese listener hears when a voice mispronounces a kanji compound. They cannot capture the moment a customer support voice reads out a booking ID incorrectly and the listener asks it to repeat.

VoiceArena uses blind pairwise comparison. Two models read the same sentence. A human evaluator listens to both outputs and selects the better one. Model identities are hidden — removing brand bias and forcing evaluation to focus entirely on perceived quality.

Each sentence was drawn from one of four production-oriented categories:

Category	What It Tests
Gen-conv & AI assistants	Natural conversation and dialogue
Customer support & contact center	IDs, confirmations, codes, transactional content
Media & entertainment	Narration, commentary, storytelling
Content creation & education	Explanations, instructional content, structured speech

Evaluators could also tag specific failure modes across seven defect types: unnatural/robotic voice, mild mispronunciation, severe mispronunciation, irregular pacing, hallucinated extra content, missing words, and noise or distortion.

49% of all evaluations included a written comment — nearly 22,000 human observations explaining exactly what the evaluator heard. These comments reveal not only which model won, but why — and that diagnostic layer is what separates VoiceArena's data from automated benchmark scores alone.

The Leaderboard — What 44,000 Votes Decided

Chart 1 — Overall Rankings

TTS Model Win Rates — All Languages Combined

Head-to-head win rate across all pairwise matchups. Ties excluded. 44,387 total evaluations.

Gemini leads by 15.8 percentage points over the nearest competitor

Source: approved_votes_*.csv, all 6 language files. Fields: model_a_name, model_b_name, outcome. Win rate = wins ÷ (wins + losses). Ties excluded from denominator.

#	Model	Organisation	Win Rate	Wins	Losses	Ties
1	Gemini 3.1 Flash TTS	Google DeepMind	86.9%	6,851	1,031	4,855
2	Eleven v3	ElevenLabs	71.1%	5,300	2,156	5,284
3	Bulbul-V3	Sarvam AI	53.1%	823	726	924
4	S2 Pro	Fish Audio	42.4%	403	548	851
5	Sonic-3	Cartesia	40.3%	3,192	4,720	4,947
6	GPT-4o-mini TTS	OpenAI	39.7%	3,025	4,597	5,070
7	Speech 2.8 HD	MiniMax	39.6%	1,976	3,013	3,148
8	Grok TTS	xAI	37.6%	2,775	4,611	5,280
9	Dragon HD Omni	Microsoft Azure	30.8%	2,360	5,303	5,005

Three observations stand out. First, the gap at the top is unusually large — Gemini wins nearly nine out of every ten decisive matchups, while ElevenLabs establishes itself as a clear second at 71.1%. After that, five models cluster tightly between 37% and 40%.

Second, the high tie rate is meaningful. Nearly 40% of all evaluations ended without a clear winner — suggesting real convergence in voice quality among frontier systems. The differentiators emerge in harder conditions: complex sentences, structured data, non-English languages.

Third, Sarvam AI's position deserves careful interpretation. Bulbul-V3 ranks third overall, ahead of OpenAI, Microsoft, Cartesia, and xAI. However, Sarvam participated in substantially fewer evaluations (2,473 appearances) than frontier models evaluated at 12,000+ appearances. The directional signal is real and consistent, but should be read with wider uncertainty than higher-volume models.

Why Gemini Wins — Consistently, Across Everything

Chart 2 — Gemini Multilingual Performance

Gemini 3.1 Flash TTS — Win Rate by Language

Gemini's highest win rate appears in Japanese (91.5%) rather than English (80.9%) — an uncommon pattern in multilingual TTS evaluation.

727W/172L in English · 1,380W/128L in Japanese

Source: Rows where model_a_name or model_b_name contains "Gemini". Win rate = wins ÷ (wins + losses) per language. Ties excluded. n: EN=899, HI=1,516, VI=873, PT=1,239, AR=1,847, JA=1,508.

Language	Win Rate	Wins	Losses
Japanese	91.5%	1,380	128
Arabic	89.7%	1,657	190
Portuguese	88.5%	1,097	142
Vietnamese	85.6%	747	126
Hindi	82.0%	1,243	273
English	80.9%	727	172

Gemini's performance is not driven by a single strength. It does not dominate one language while remaining competitive elsewhere — the advantage holds across the entire benchmark, and in some languages it strengthens.

Japanese is one of the hardest languages for TTS systems: complex pitch accent rules, mixed scripts combining hiragana, katakana, and kanji, and number reading conventions that differ fundamentally from phonetic languages. The fact that Gemini performs most strongly here suggests multilingual capability is a core design property, not a feature layered on top of an English-first architecture.

The same consistency appears across use cases. Customer support — filled with tracking numbers, booking IDs, and payment references — maintained an 87.2% Gemini win rate. Nearly identical to its performance in simpler categories.

The audio is clear, very expressive and natural sounding. The pacing is good.

Very clear, good pace, numbers are easy to understand.

Good timing, clear pronunciation, natural delivery.

The evidence points toward a simple conclusion: Gemini does not merely sound better. It fails less often. And when it does fail, those failures tend to be less consequential than the ones observed in competing systems.

When Reliability Breaks: Grok's Hallucination Problem

Chart 3 — Content Reliability

Hallucination Rate by Model

% of model appearances where evaluators flagged extra content not present in the original script.

Grok: 1,323 flags / 12,666 appearances = 10.45% — approximately 3.9× Gemini's rate (2.69%)

Source: For each model appearance, checked whether defect_tags_on_a or defect_tags_on_b contained "hallucination_extra_content". Rate = flagged ÷ total appearances. Each vote row contributes two model appearances.

Model	Hallucination Rate	Flagged	Appearances
xAI Grok TTS	10.45%	1,323	12,666
Sarvam AI Bulbul-V3	6.51%	161	2,473
Microsoft Azure Dragon HD Omni	6.46%	818	12,668
MiniMax Speech 2.8 HD	6.06%	493	8,137
Cartesia Sonic-3	6.03%	775	12,859
OpenAI GPT-4o-mini TTS	5.77%	732	12,692
ElevenLabs Eleven v3	4.32%	551	12,740
Google DeepMind Gemini 3.1 Flash	2.69%	343	12,737
Fish Audio S2 Pro	2.22%	40	1,802

There is a category of TTS failure that is fundamentally different from sounding robotic or mispronouncing a word. It occurs when the model does not say what it was given — it adds, alters, or invents content. A robotic voice may be unpleasant. A voice that changes information becomes untrustworthy.

The more revealing result emerges when Grok's hallucination rate is broken down by language:

Language	Hallucination Rate	Flagged	Appearances
Arabic	18.0%	478	2,653
Hindi	10.5%	248	2,371
Japanese	10.3%	227	2,195
Portuguese	9.7%	214	2,198
Vietnamese	8.9%	132	1,490
English	1.4%	24	1,759

In English, hallucinations appear at 1.4% — broadly manageable. In Arabic, nearly one in five Grok outputs was flagged for introducing content that did not exist in the source text.

Added an extra word and did not pronounce the Arabic word correctly.

Some words have been omitted and some other words have been added.

Read the flight number incorrectly.

Wrong number in saying 1422.

As hallucination rates increase, win rates decline — English 61.9%, Vietnamese 48.0%, Hindi 36.7%, Arabic 32.9%, Portuguese 32.1%, Japanese 27.5%. The data supports a strong association between content-fidelity failures and reduced human preference. A model's English performance may not accurately reflect its reliability profile in other languages.

Microsoft's Multilingual Gap

Chart 4 — Microsoft Multilingual Performance

Microsoft Azure Dragon HD Omni — Win Rate by Language

Win rate falls from 48.6% in English to 12.6% in Japanese — a 36 percentage point collapse across the same model.

Loses nearly 7 of 8 decisive matchups in Japanese (176W / 1,226L)

Source: Rows where model_a_name or model_b_name contains "Microsoft". Win rate = wins ÷ (wins + losses) per language. Ties excluded. n: JA=1,973, HI=1,601, AR=1,627, PT=1,195, VI=877, EN=880.

Language	Win Rate	Wins	Losses	Robotic Voice Flag Rate
Japanese	12.6%	176	1,226	42.6%
Hindi	29.2%	467	1,134	36.1%
Arabic	33.0%	544	1,105	25.6%
Portuguese	33.2%	416	838	36.9%
Vietnamese	37.5%	329	548	37.6%
English	48.6%	428	452	13.2%

In Japanese, Microsoft loses nearly seven out of eight decisive matchups. Across all evaluated models and defect categories, Microsoft accumulated the highest volume of robotic-voice flags on any single defect type — 4,072 total. That pattern concentrates heavily in non-English languages, where the robotic-voice flag rate more than triples compared to English.

Mispronunciation and unnatural tone. Robotic.

Severe mispronounced words. Robotic voice.

Read it too fast, sounding like a robot.

The voice sounds more robotic despite maintaining clarity.

The broader finding is worth stating plainly: supporting a language is not the same as mastering it. A model may technically generate speech in dozens of languages. What matters in practice is whether users consistently prefer that speech when compared against competing systems. On that measure, Microsoft performs very differently depending on language.

The Universal Failure: Structured Content

Before examining which defects hurt win rates most, it is worth identifying the one category of input that consistently causes problems across every model in this benchmark.

Not casual conversation. Not emotional speech. Not technical vocabulary. Structured content.

Numbers, tracking IDs, booking references, payment codes, legal identifiers, timestamps, and mixed alphanumeric strings repeatedly emerged as some of the most challenging inputs across the evaluation. Customer-support scenarios — the category most heavily populated with transactional content — generated the highest defect rates observed anywhere in the benchmark at 52.2%.

The number was read with unnatural pauses between digits.

Wrong number in saying 1422.

The acronym was pronounced as a word instead of individual letters.

Flight number VJ-2284 was read incorrectly.

License number was not pronounced correctly.

Wrong Arabic number form used.

These failures appeared across models, languages, and use cases. Even Gemini — the strongest overall performer — occasionally struggled with acronym handling and structured references. No model is immune.

That matters because structured content is not an edge case. It appears most frequently in the applications where reliability matters most: customer support, financial services, healthcare, logistics, and navigation. A conversational error may be annoying. A booking reference error can be costly.

Humans Forgive Pronunciation Imperfections. They Don't Forgive Wrong Words.

Chart 5 — Original Finding

Win Rate When a Specific Defect Is Flagged on That Model's Output

When a human evaluator flags a defect — how often does the flagged model still win? Baseline (no defect flagged): 45.1% win rate.

Baseline — no defect flagged: 45.1% (20,602 wins / 45,689 appearances)

Severe mispronunciation reduces win rate to 5.1%. Mild mispronunciation: 18.1%.

Source: For each defect tag, identified model appearances where that tag was present. Win rate = wins ÷ total tagged appearances. Raw counts: Hallucination 435W/5,236 · Severe Mispr. 539W/10,641 · Missing Word 411W/4,044 · Robotic 1,771W/16,873 · Pacing 1,215W/10,499 · Noise 642W/3,748 · Mild Mispr. 2,932W/16,222. Baseline: 20,602W/45,689.

Defect	Win Rate When Present	Drop From Baseline	Raw Count
Severe mispronunciation	5.1%	−40.0 pts	539W / 10,641
Hallucination (extra content)	8.3%	−36.8 pts	435W / 5,236
Missing word	10.2%	−34.9 pts	411W / 4,044
Unnatural / robotic voice	10.5%	−34.6 pts	1,771W / 16,873
Irregular pacing	11.6%	−33.5 pts	1,215W / 10,499
Noise / distortion	17.1%	−28.0 pts	642W / 3,748
Mild mispronunciation	18.1%	−27.0 pts	2,932W / 16,222

Every evaluated defect is associated with a substantial reduction in win probability. But the hierarchy is revealing. Severe mispronunciation is associated with the lowest win rate — 5.1%. A model flagged for severe mispronunciation wins only 1 in 20 matchups.

At the opposite end sits mild mispronunciation. Human evaluators clearly notice it. They mention it frequently. Yet models flagged for mild mispronunciation still win far more often than models that alter, omit, or invent content.

This distinction reveals something important about how listeners evaluate voice systems. A minor pronunciation imperfection is interpreted as a quality issue. A wrong word is interpreted as a reliability issue. One affects polish. The other affects trust. For model builders, the implication is direct: optimizing for perfect pronunciation while allowing content-fidelity failures is optimizing the wrong objective.

What This Means for Model Builders

1. Structured-Content Robustness Is Becoming a Competitive Advantage

Numbers, booking references, tracking IDs, payment codes, and mixed-language strings remain difficult for every evaluated model. These failures appear most frequently in the domains where accuracy matters most: customer support, healthcare, finance, logistics. The model that handles a Portuguese payment reference or a Japanese booking code with the same confidence as conversational English will create a meaningful deployment advantage.

2. Multilingual Depth Matters More Than Multilingual Coverage

Many modern TTS systems support dozens of languages. That alone is no longer a differentiator. Gemini's strongest performance appears in Japanese rather than English. Microsoft's performance drops 36 points between English and Japanese. Grok's hallucination profile changes substantially depending on language. Coverage is easy to list. Depth is harder to achieve.

3. Hallucination Is a Trust Problem, Not a Quality Problem

A robotic voice may reduce user satisfaction. A hallucinated word changes meaning. These are not equivalent failures. Healthcare systems cannot invent dosage instructions. Financial systems cannot alter payment references. Customer-support systems cannot change booking identifiers. As voice AI integrates into critical workflows, trust will become an increasingly important competitive factor.

Limitations

This analysis relies on evaluator-reported defect annotations collected during pairwise human evaluations. Several limitations should be considered when interpreting the findings.

First, multiple defects may co-occur within the same output. Defect-specific win rates should therefore be interpreted as associations rather than isolated causal effects.

Second, models were evaluated at different volumes. Systems with fewer appearances have wider uncertainty ranges than models evaluated thousands of times. Sarvam AI's results in particular should be read with this in mind.

Finally, human preference is inherently subjective. Pairwise evaluation captures perceived quality rather than an objective measure of correctness.

The Question This Data Raises

Ten years ago, the challenge in speech synthesis was making machines sound human. The research agenda was clear: improve naturalness.

Today, most frontier systems can sound human. The challenge has shifted.

Across 44,387 human evaluations, listeners repeatedly penalized models not for lacking expressiveness — but for violating expectations of reliability. Misreading a number. Adding a word. Dropping a word. Failing to preserve meaning across languages.

Nearly 40% of all matchups ended in a tie. Human listeners increasingly struggle to separate frontier systems on voice quality alone. Voice quality is becoming commoditized. Trust is not.

The model that earns trust — that says exactly what it was given, in any language, across any content type, without adding, removing, or distorting — will not simply top a leaderboard. It will power the applications people actually depend on.

Voice quality brought TTS to the frontier. Reliability will determine who stays there.

References