TTS Benchmark · June 2026

What 44,000 Human Votes Reveal About Where Text-to-Speech Models Actually Fail

An analysis of VoiceArena's multilingual benchmark across 9 leading TTS models, 6 languages, and 44,387 human pairwise evaluations.

VoiceArena Research June 2026 15 min read 44,387 evaluations 9 models · 6 languages
Executive Summary

Voice AI has improved dramatically. Most frontier text-to-speech systems now sound natural enough that listeners often struggle to distinguish between them in straightforward scenarios.

But sounding human and being trusted are not the same thing.

The frontier challenge in voice AI is no longer naturalness. It is reliability.

The Evaluation At A Glance
MetricValue
Total human votes44,387
Models evaluated9
Languages coveredEnglish, Hindi, Arabic, Japanese, Portuguese, Vietnamese
Use case categories4
Votes with written comments21,756 (49.0%)
Matchups ending in a tie17,682 (39.8%)
Total defect flags raised67,263

Introduction

If you are building a text-to-speech model today, you already know the standard benchmark numbers. You have seen automated scores. You have run internal evaluations. You know how your model performs on the metrics machines can measure.

But many of the failures that matter most to users are invisible in those metrics. A benchmark can tell you whether two audio clips are acoustically similar. It cannot tell you that a customer-support voice added a word that was never present in the script. It cannot tell you that a booking ID became difficult to understand in Japanese. It cannot tell you that a model sounds authoritative in English but robotic in Hindi.

That is exactly what VoiceArena's evaluation data captures. Across 44,387 human pairwise evaluations covering 9 leading TTS models and 6 languages, evaluators did more than select winners — nearly half wrote comments explaining their decisions in plain language. The result is one of the most detailed human-preference datasets available for multilingual text-to-speech evaluation.

What emerges is a clear trend: the gap between leading models is increasingly defined not by voice quality alone, but by reliability. The strongest systems consistently preserve meaning, pronunciation, pacing, and content fidelity across languages. The weaker systems fail in ways that are specific, measurable, and often surprisingly consistent.

How VoiceArena Evaluates

Most automated TTS benchmarks measure things machines can measure — word error rate, pitch consistency, spectral similarity. These are useful. They are also insufficient. They cannot capture what a Japanese listener hears when a voice mispronounces a kanji compound. They cannot capture the moment a customer support voice reads out a booking ID incorrectly and the listener asks it to repeat.

VoiceArena uses blind pairwise comparison. Two models read the same sentence. A human evaluator listens to both outputs and selects the better one. Model identities are hidden — removing brand bias and forcing evaluation to focus entirely on perceived quality.

Each sentence was drawn from one of four production-oriented categories:

CategoryWhat It Tests
Gen-conv & AI assistantsNatural conversation and dialogue
Customer support & contact centerIDs, confirmations, codes, transactional content
Media & entertainmentNarration, commentary, storytelling
Content creation & educationExplanations, instructional content, structured speech

Evaluators could also tag specific failure modes across seven defect types: unnatural/robotic voice, mild mispronunciation, severe mispronunciation, irregular pacing, hallucinated extra content, missing words, and noise or distortion.

49% of all evaluations included a written comment — nearly 22,000 human observations explaining exactly what the evaluator heard. These comments reveal not only which model won, but why — and that diagnostic layer is what separates VoiceArena's data from automated benchmark scores alone.

The Leaderboard — What 44,000 Votes Decided

Chart 1 — Overall Rankings
TTS Model Win Rates — All Languages Combined
Head-to-head win rate across all pairwise matchups. Ties excluded. 44,387 total evaluations.
Gemini leads by 15.8 percentage points over the nearest competitor
Source: approved_votes_*.csv, all 6 language files. Fields: model_a_name, model_b_name, outcome. Win rate = wins ÷ (wins + losses). Ties excluded from denominator.
#ModelOrganisationWin RateWinsLossesTies
1Gemini 3.1 Flash TTSGoogle DeepMind86.9%6,8511,0314,855
2Eleven v3ElevenLabs71.1%5,3002,1565,284
3Bulbul-V3Sarvam AI53.1%823726924
4S2 ProFish Audio42.4%403548851
5Sonic-3Cartesia40.3%3,1924,7204,947
6GPT-4o-mini TTSOpenAI39.7%3,0254,5975,070
7Speech 2.8 HDMiniMax39.6%1,9763,0133,148
8Grok TTSxAI37.6%2,7754,6115,280
9Dragon HD OmniMicrosoft Azure30.8%2,3605,3035,005

Three observations stand out. First, the gap at the top is unusually large — Gemini wins nearly nine out of every ten decisive matchups, while ElevenLabs establishes itself as a clear second at 71.1%. After that, five models cluster tightly between 37% and 40%.

Second, the high tie rate is meaningful. Nearly 40% of all evaluations ended without a clear winner — suggesting real convergence in voice quality among frontier systems. The differentiators emerge in harder conditions: complex sentences, structured data, non-English languages.

Third, Sarvam AI's position deserves careful interpretation. Bulbul-V3 ranks third overall, ahead of OpenAI, Microsoft, Cartesia, and xAI. However, Sarvam participated in substantially fewer evaluations (2,473 appearances) than frontier models evaluated at 12,000+ appearances. The directional signal is real and consistent, but should be read with wider uncertainty than higher-volume models.

Why Gemini Wins — Consistently, Across Everything

Chart 2 — Gemini Multilingual Performance
Gemini 3.1 Flash TTS — Win Rate by Language
Gemini's highest win rate appears in Japanese (91.5%) rather than English (80.9%) — an uncommon pattern in multilingual TTS evaluation.
727W/172L in English · 1,380W/128L in Japanese
Source: Rows where model_a_name or model_b_name contains "Gemini". Win rate = wins ÷ (wins + losses) per language. Ties excluded. n: EN=899, HI=1,516, VI=873, PT=1,239, AR=1,847, JA=1,508.
LanguageWin RateWinsLosses
Japanese91.5%1,380128
Arabic89.7%1,657190
Portuguese88.5%1,097142
Vietnamese85.6%747126
Hindi82.0%1,243273
English80.9%727172

Gemini's performance is not driven by a single strength. It does not dominate one language while remaining competitive elsewhere — the advantage holds across the entire benchmark, and in some languages it strengthens.

Japanese is one of the hardest languages for TTS systems: complex pitch accent rules, mixed scripts combining hiragana, katakana, and kanji, and number reading conventions that differ fundamentally from phonetic languages. The fact that Gemini performs most strongly here suggests multilingual capability is a core design property, not a feature layered on top of an English-first architecture.

The same consistency appears across use cases. Customer support — filled with tracking numbers, booking IDs, and payment references — maintained an 87.2% Gemini win rate. Nearly identical to its performance in simpler categories.

The audio is clear, very expressive and natural sounding. The pacing is good.
Very clear, good pace, numbers are easy to understand.
Good timing, clear pronunciation, natural delivery.

The evidence points toward a simple conclusion: Gemini does not merely sound better. It fails less often. And when it does fail, those failures tend to be less consequential than the ones observed in competing systems.

When Reliability Breaks: Grok's Hallucination Problem

Chart 3 — Content Reliability
Hallucination Rate by Model
% of model appearances where evaluators flagged extra content not present in the original script.
Grok: 1,323 flags / 12,666 appearances = 10.45% — approximately 3.9× Gemini's rate (2.69%)
Source: For each model appearance, checked whether defect_tags_on_a or defect_tags_on_b contained "hallucination_extra_content". Rate = flagged ÷ total appearances. Each vote row contributes two model appearances.
ModelHallucination RateFlaggedAppearances
xAI Grok TTS10.45%1,32312,666
Sarvam AI Bulbul-V36.51%1612,473
Microsoft Azure Dragon HD Omni6.46%81812,668
MiniMax Speech 2.8 HD6.06%4938,137
Cartesia Sonic-36.03%77512,859
OpenAI GPT-4o-mini TTS5.77%73212,692
ElevenLabs Eleven v34.32%55112,740
Google DeepMind Gemini 3.1 Flash2.69%34312,737
Fish Audio S2 Pro2.22%401,802

There is a category of TTS failure that is fundamentally different from sounding robotic or mispronouncing a word. It occurs when the model does not say what it was given — it adds, alters, or invents content. A robotic voice may be unpleasant. A voice that changes information becomes untrustworthy.

The more revealing result emerges when Grok's hallucination rate is broken down by language:

LanguageHallucination RateFlaggedAppearances
Arabic18.0%4782,653
Hindi10.5%2482,371
Japanese10.3%2272,195
Portuguese9.7%2142,198
Vietnamese8.9%1321,490
English1.4%241,759

In English, hallucinations appear at 1.4% — broadly manageable. In Arabic, nearly one in five Grok outputs was flagged for introducing content that did not exist in the source text.

Added an extra word and did not pronounce the Arabic word correctly.
Some words have been omitted and some other words have been added.
Read the flight number incorrectly.
Wrong number in saying 1422.

As hallucination rates increase, win rates decline — English 61.9%, Vietnamese 48.0%, Hindi 36.7%, Arabic 32.9%, Portuguese 32.1%, Japanese 27.5%. The data supports a strong association between content-fidelity failures and reduced human preference. A model's English performance may not accurately reflect its reliability profile in other languages.

Microsoft's Multilingual Gap

Chart 4 — Microsoft Multilingual Performance
Microsoft Azure Dragon HD Omni — Win Rate by Language
Win rate falls from 48.6% in English to 12.6% in Japanese — a 36 percentage point collapse across the same model.
Loses nearly 7 of 8 decisive matchups in Japanese (176W / 1,226L)
Source: Rows where model_a_name or model_b_name contains "Microsoft". Win rate = wins ÷ (wins + losses) per language. Ties excluded. n: JA=1,973, HI=1,601, AR=1,627, PT=1,195, VI=877, EN=880.
LanguageWin RateWinsLossesRobotic Voice Flag Rate
Japanese12.6%1761,22642.6%
Hindi29.2%4671,13436.1%
Arabic33.0%5441,10525.6%
Portuguese33.2%41683836.9%
Vietnamese37.5%32954837.6%
English48.6%42845213.2%

In Japanese, Microsoft loses nearly seven out of eight decisive matchups. Across all evaluated models and defect categories, Microsoft accumulated the highest volume of robotic-voice flags on any single defect type — 4,072 total. That pattern concentrates heavily in non-English languages, where the robotic-voice flag rate more than triples compared to English.

Mispronunciation and unnatural tone. Robotic.
Severe mispronounced words. Robotic voice.
Read it too fast, sounding like a robot.
The voice sounds more robotic despite maintaining clarity.

The broader finding is worth stating plainly: supporting a language is not the same as mastering it. A model may technically generate speech in dozens of languages. What matters in practice is whether users consistently prefer that speech when compared against competing systems. On that measure, Microsoft performs very differently depending on language.

The Universal Failure: Structured Content

Before examining which defects hurt win rates most, it is worth identifying the one category of input that consistently causes problems across every model in this benchmark.

Not casual conversation. Not emotional speech. Not technical vocabulary. Structured content.

Numbers, tracking IDs, booking references, payment codes, legal identifiers, timestamps, and mixed alphanumeric strings repeatedly emerged as some of the most challenging inputs across the evaluation. Customer-support scenarios — the category most heavily populated with transactional content — generated the highest defect rates observed anywhere in the benchmark at 52.2%.

The number was read with unnatural pauses between digits.
Wrong number in saying 1422.
The acronym was pronounced as a word instead of individual letters.
Flight number VJ-2284 was read incorrectly.
License number was not pronounced correctly.
Wrong Arabic number form used.

These failures appeared across models, languages, and use cases. Even Gemini — the strongest overall performer — occasionally struggled with acronym handling and structured references. No model is immune.

That matters because structured content is not an edge case. It appears most frequently in the applications where reliability matters most: customer support, financial services, healthcare, logistics, and navigation. A conversational error may be annoying. A booking reference error can be costly.

Humans Forgive Pronunciation Imperfections. They Don't Forgive Wrong Words.

Chart 5 — Original Finding
Win Rate When a Specific Defect Is Flagged on That Model's Output
When a human evaluator flags a defect — how often does the flagged model still win? Baseline (no defect flagged): 45.1% win rate.
Baseline — no defect flagged: 45.1% (20,602 wins / 45,689 appearances)
Severe mispronunciation reduces win rate to 5.1%. Mild mispronunciation: 18.1%.
Source: For each defect tag, identified model appearances where that tag was present. Win rate = wins ÷ total tagged appearances. Raw counts: Hallucination 435W/5,236 · Severe Mispr. 539W/10,641 · Missing Word 411W/4,044 · Robotic 1,771W/16,873 · Pacing 1,215W/10,499 · Noise 642W/3,748 · Mild Mispr. 2,932W/16,222. Baseline: 20,602W/45,689.
DefectWin Rate When PresentDrop From BaselineRaw Count
Severe mispronunciation5.1%−40.0 pts539W / 10,641
Hallucination (extra content)8.3%−36.8 pts435W / 5,236
Missing word10.2%−34.9 pts411W / 4,044
Unnatural / robotic voice10.5%−34.6 pts1,771W / 16,873
Irregular pacing11.6%−33.5 pts1,215W / 10,499
Noise / distortion17.1%−28.0 pts642W / 3,748
Mild mispronunciation18.1%−27.0 pts2,932W / 16,222

Every evaluated defect is associated with a substantial reduction in win probability. But the hierarchy is revealing. Severe mispronunciation is associated with the lowest win rate — 5.1%. A model flagged for severe mispronunciation wins only 1 in 20 matchups.

At the opposite end sits mild mispronunciation. Human evaluators clearly notice it. They mention it frequently. Yet models flagged for mild mispronunciation still win far more often than models that alter, omit, or invent content.

This distinction reveals something important about how listeners evaluate voice systems. A minor pronunciation imperfection is interpreted as a quality issue. A wrong word is interpreted as a reliability issue. One affects polish. The other affects trust. For model builders, the implication is direct: optimizing for perfect pronunciation while allowing content-fidelity failures is optimizing the wrong objective.

What This Means for Model Builders

1. Structured-Content Robustness Is Becoming a Competitive Advantage

Numbers, booking references, tracking IDs, payment codes, and mixed-language strings remain difficult for every evaluated model. These failures appear most frequently in the domains where accuracy matters most: customer support, healthcare, finance, logistics. The model that handles a Portuguese payment reference or a Japanese booking code with the same confidence as conversational English will create a meaningful deployment advantage.

2. Multilingual Depth Matters More Than Multilingual Coverage

Many modern TTS systems support dozens of languages. That alone is no longer a differentiator. Gemini's strongest performance appears in Japanese rather than English. Microsoft's performance drops 36 points between English and Japanese. Grok's hallucination profile changes substantially depending on language. Coverage is easy to list. Depth is harder to achieve.

3. Hallucination Is a Trust Problem, Not a Quality Problem

A robotic voice may reduce user satisfaction. A hallucinated word changes meaning. These are not equivalent failures. Healthcare systems cannot invent dosage instructions. Financial systems cannot alter payment references. Customer-support systems cannot change booking identifiers. As voice AI integrates into critical workflows, trust will become an increasingly important competitive factor.

Limitations

This analysis relies on evaluator-reported defect annotations collected during pairwise human evaluations. Several limitations should be considered when interpreting the findings.

First, multiple defects may co-occur within the same output. Defect-specific win rates should therefore be interpreted as associations rather than isolated causal effects.

Second, models were evaluated at different volumes. Systems with fewer appearances have wider uncertainty ranges than models evaluated thousands of times. Sarvam AI's results in particular should be read with this in mind.

Finally, human preference is inherently subjective. Pairwise evaluation captures perceived quality rather than an objective measure of correctness.

The Question This Data Raises

Ten years ago, the challenge in speech synthesis was making machines sound human. The research agenda was clear: improve naturalness.

Today, most frontier systems can sound human. The challenge has shifted.

Across 44,387 human evaluations, listeners repeatedly penalized models not for lacking expressiveness — but for violating expectations of reliability. Misreading a number. Adding a word. Dropping a word. Failing to preserve meaning across languages.

Nearly 40% of all matchups ended in a tie. Human listeners increasingly struggle to separate frontier systems on voice quality alone. Voice quality is becoming commoditized. Trust is not.

The model that earns trust — that says exactly what it was given, in any language, across any content type, without adding, removing, or distorting — will not simply top a leaderboard. It will power the applications people actually depend on.

Voice quality brought TTS to the frontier. Reliability will determine who stays there.

References