Open Source Notable

Mistral Open-Sources Voxtral TTS: A 3.4B-Parameter Speech Model That Fits on a Smartwatch

March 26, 2026 2 min read

Image: Mistral AI

A text-to-speech model that runs on a smartwatch and clones your voice from a five-second clip. That's Voxtral TTS, Mistral's newest open-weight release, and it's a direct shot at ElevenLabs, PlayHT, and every other cloud-dependent voice API.

What Mistral Actually Built

Voxtral TTS is a three-part system: a 3.4-billion-parameter transformer decoder backbone (built on Ministral 3B, the same base as their transcription model), a 390-million-parameter flow-matching acoustic transformer, and a 300-million-parameter neural audio codec that Mistral developed in-house.

The practical numbers: 90-millisecond time-to-first-audio on a typical input, roughly six times real-time generation speed, and support for nine languages - English, French, German, Spanish, Dutch, Portuguese, Italian, Hindi, and Arabic. Feed it a five-second voice sample and it reproduces subtle accents, inflections, and speech irregularities. It can also switch between languages mid-stream without losing the voice characteristics, which matters for dubbing and real-time translation.

Mistral is releasing the full model weights under an open license. Any company can download it, run it on their own hardware - from a server rack down to a smartphone - and never send a single audio frame to a third party.

Who Should Care

The on-device angle is what separates this from the pack. ElevenLabs and OpenAI's TTS both require cloud API calls, which means latency, per-minute costs, and data leaving your infrastructure. Voxtral TTS running locally on a phone or laptop eliminates all three.

For voice AI assistants, customer support bots, and accessibility tools, that's a meaningful shift. A company building a voice agent can now self-host the entire speech pipeline - Voxtral Transcribe for speech-to-text, their LLM for reasoning, and Voxtral TTS for the response - without any external API dependency.

The smartwatch claim is ambitious. Running a 3.4B-parameter model on wrist hardware will require aggressive quantization (compressing the model to use less memory), and real-world quality at that compression level remains to be tested. But even at the smartphone tier, this is a strong open-source alternative to paid voice APIs that currently dominate the market.

What Mistral Actually Built

Who Should Care

Related Tools

More from today

NVIDIA Shrinks OpenAI's 120B Open Model to 88B Parameters, Runs 2.8x Faster

cc-lens: Open Source Dashboard That Tracks Your Claude Code Usage and Costs

rses Lets You Hand Off Sessions Between Claude Code, Codex, and OpenCode

Cookie Preferences