Models Notable

Google's Gemini 3.1 Flash TTS Adds Natural Language Voice Controls and 70-Language Support

April 15, 2026 2 min read Source: Google AI Blog

Gemini 3.1 Flash TTS: the next generation of expressive AI speech

Image: Google

Google just shipped Gemini 3.1 Flash TTS, a new text-to-speech model aimed at developers building voice applications. It debuted at an Elo score of 1,211 on the Artificial Analysis TTS leaderboard - the same rating system used in chess to rank relative strength - placing it at the top of the quality-versus-cost chart among competing services.

The standout addition is Audio Tags: natural language instructions you write directly in your text to control how the voice sounds. Instead of learning SSML (Speech Synthesis Markup Language, the technical tag format most TTS APIs use), you write something like "[slowly, with gravity]" or "[upbeat, fast pace]" and the model interprets it. This removes the need for specialized markup knowledge to get expressive output.

What else shipped:

Multi-speaker dialogue - distinct voices for multiple characters in a single generation, without stitching separate API calls together
Audio Profiles - per-speaker voice settings that stay consistent across sessions, useful for branded voice applications or game characters
Scene direction - environmental context that shapes delivery, so the model delivers lines differently for a "crowded cafÃ©" versus a "boardroom presentation"
SynthID watermarking - all outputs carry an imperceptible watermark identifying them as AI-generated audio
70+ language support with localized expressiveness, not just translated words but adapted delivery per language

According to Google's announcement, the model is in preview via the Gemini API and Google AI Studio, with enterprise access through Vertex AI and integration in Google Vids. Pricing has not been announced.

The main competition is ElevenLabs, which has dominated the expressive TTS market, and OpenAI's TTS API. Google hasn't published direct comparisons against either service. But if Google prices this like its other Flash models - which run significantly cheaper than flagship - it could make high-volume voice generation substantially less expensive than ElevenLabs for developers who don't need that platform's actor-based voice marketplace.

Source

Google AI Blog Gemini 3.1 Flash TTS: the next generation of expressive AI speech →

Source

More from today

Google Gemma 4 Runs Fully Offline on iPhone, No Internet Required

OpenAI Agents SDK Gets Enterprise Safety and Capability Updates

Gizmo Raises $22M Series A After Reaching 13M Users on AI Learning Platform

Cookie Preferences