Google's Gemini 3.1 Flash TTS Adds Natural Language Voice Controls and 70-Language Support

Gemini 3.1 Flash TTS: the next generation of expressive AI speech
Image: Google

Google just shipped Gemini 3.1 Flash TTS, a new text-to-speech model aimed at developers building voice applications. It debuted at an Elo score of 1,211 on the Artificial Analysis TTS leaderboard - the same rating system used in chess to rank relative strength - placing it at the top of the quality-versus-cost chart among competing services.

The standout addition is Audio Tags: natural language instructions you write directly in your text to control how the voice sounds. Instead of learning SSML (Speech Synthesis Markup Language, the technical tag format most TTS APIs use), you write something like "[slowly, with gravity]" or "[upbeat, fast pace]" and the model interprets it. This removes the need for specialized markup knowledge to get expressive output.

What else shipped:

  • Multi-speaker dialogue - distinct voices for multiple characters in a single generation, without stitching separate API calls together
  • Audio Profiles - per-speaker voice settings that stay consistent across sessions, useful for branded voice applications or game characters
  • Scene direction - environmental context that shapes delivery, so the model delivers lines differently for a "crowded café" versus a "boardroom presentation"
  • SynthID watermarking - all outputs carry an imperceptible watermark identifying them as AI-generated audio
  • 70+ language support with localized expressiveness, not just translated words but adapted delivery per language

According to Google's announcement, the model is in preview via the Gemini API and Google AI Studio, with enterprise access through Vertex AI and integration in Google Vids. Pricing has not been announced.

The main competition is ElevenLabs, which has dominated the expressive TTS market, and OpenAI's TTS API. Google hasn't published direct comparisons against either service. But if Google prices this like its other Flash models - which run significantly cheaper than flagship - it could make high-volume voice generation substantially less expensive than ElevenLabs for developers who don't need that platform's actor-based voice marketplace.