Open Source

llama.cpp Adds Multi-Token Prediction, Speeding Up Local Model Inference

May 16, 2026 1 min read

Image: Meta

llama.cpp, the most popular open-source library for running large language models on your own hardware, just merged support for multi-token prediction (MTP). The change landed via PR #22673 on May 16 and is now in the main branch.

Multi-token prediction is a technique where a model generates several tokens - the small text chunks AI models work with, roughly a word or part of a word each - in a single step rather than sequentially. Standard generation works one token at a time: predict a token, add it to the context, predict the next. MTP lets the model draft multiple tokens simultaneously using a secondary prediction head, then verify them with the main model in one pass.

The practical result is faster text output. Models that already support MTP at the architecture level - DeepSeek V3 and DeepSeek R1 being the most notable - should generate text noticeably quicker when run locally through llama.cpp. No configuration required: if the model you load supports MTP, llama.cpp uses it automatically.

For users running local models on a gaming PC or Apple Silicon Mac, this is a meaningful speed improvement that doesn't require newer hardware. Cloud providers have used similar inference optimizations internally for some time; the merge brings llama.cpp current with those techniques for the self-hosted crowd.

Related Tools

More from today

Prompt Injection: The Security Threat Hidden in Every Webpage Your AI Agent Reads

ArXiv Will Ban Authors for a Year Over AI-Written Papers

AI Job Losses in the US Are Moving From Prediction to Reality

Cookie Preferences