Models Notable

Google Quietly Built Multi-Token Prediction Into Gemma 4 - Community Found It First

April 7, 2026 2 min read

Image: Google

Community researchers digging into Gemma 4's internals found something Google hadn't put in the release notes: multi-token prediction.

MTP - multi-token prediction - is a technique where the model generates several output tokens simultaneously per processing pass, rather than one at a time. A token is roughly 75% of a word. Standard models do one full forward pass through their neural network per token generated; MTP reduces the total number of passes required for a given output, which can increase generation speed by 20-50% depending on implementation and hardware. The output quality stays the same - it's a speed optimization at the processing layer, not a change to what the model knows or how it reasons.

DeepSeek made MTP a documented feature in their V3 and R1 model releases, publishing architecture details about how they implemented it. Google appears to have included equivalent functionality in Gemma 4 without the same level of announcement. Developers using Gemma 4 may have been getting MTP benefits without knowing the mechanism behind them - or missing out on them entirely due to framework configuration.

Getting MTP Working in Practice

For developers running Gemma 4 locally, MTP support depends on whether your inference framework - the software layer that runs the model on your hardware - has implemented it. LM Studio and Ollama are the two most common local tools, and MTP activation varies by version and config.

If MTP isn't enabled in your setup, generation speed may be slower than the hardware could theoretically support. The practical check: look at your framework's recent changelog for MTP or speculative decoding mentions, verify the feature is active in the model config, and compare tokens-per-second before and after enabling it.

An Underdocumented Release

The MTP discovery fits a broader pattern with Gemma 4's launch. Community researchers have also surfaced details about the model's mixture-of-experts routing - the system that determines which portions of the model are active for any given input - that weren't front-and-center in Google's official materials.

For a model positioned partly at local and open-source developers - people who will, by definition, dig into the architecture - releasing features without documenting them creates unnecessary friction. Users who don't know to look for MTP won't configure it. Benchmark comparisons shared across the community may be measuring different effective configurations without realizing it.

The model itself is capable. The open question is how many developers are running a slower default configuration they don't know they're on.

Getting MTP Working in Practice

An Underdocumented Release

More from today

Anthropic's Mythos Preview Claims a Capability Step Change - With No Public Access

Anthropic Previews Mythos, an AI Model Built for Defensive Cybersecurity

AMD AI Director Says Claude Has Regressed on Complex Engineering Tasks

Cookie Preferences