llama.cpp, the most popular open-source library for running large language models on your own hardware, just merged support for multi-token prediction (MTP). The change landed via PR #22673 on May 16 and is now in the main branch.
Multi-token prediction is a technique where a model generates several tokens - the small text chunks AI models work with, roughly a word or part of a word each - in a single step rather than sequentially. Standard generation works one token at a time: predict a token, add it to the context, predict the next. MTP lets the model draft multiple tokens simultaneously using a secondary prediction head, then verify them with the main model in one pass.
The practical result is faster text output. Models that already support MTP at the architecture level - DeepSeek V3 and DeepSeek R1 being the most notable - should generate text noticeably quicker when run locally through llama.cpp. No configuration required: if the model you load supports MTP, llama.cpp uses it automatically.
For users running local models on a gaming PC or Apple Silicon Mac, this is a meaningful speed improvement that doesn't require newer hardware. Cloud providers have used similar inference optimizations internally for some time; the merge brings llama.cpp current with those techniques for the self-hosted crowd.