Open Source

Llama.cpp Fixes Gemma 4's Broken KV Cache That Was Eating All Your VRAM

April 4, 2026 1 min read

Image: Meta

Running Google's Gemma 4 locally just went from impractical to viable. An update to llama.cpp, the most popular tool for running large language models on your own hardware, fixes a bug that caused Gemma 4's KV cache to consume absurd amounts of VRAM.

The KV cache is the memory a model uses to keep track of your conversation as it generates responses. Every token (roughly a word or word fragment) the model processes gets stored there, and it grows as conversations get longer. With Gemma 4, this cache was broken in llama.cpp - it was allocating far more video memory than it should have, to the point where even high-end GPUs couldn't handle normal conversations.

For anyone running Gemma 4 locally, this is the update you've been waiting for. The fix means the model now uses VRAM at expected levels, putting it back in line with what your hardware should actually handle. If you tried Gemma 4 on launch day, hit a wall of out-of-memory errors, and shelved it, now's the time to pull the latest llama.cpp build and try again.

More from today

Developers Push Alibaba to Open Source Qwen3.6's 397B MoE Model

Microsoft Publishes New Open-Source Agent Framework on GitHub

Gemma 4 26B Runs at 4 Watts on Rockchip NPU via Custom llama.cpp Fork

Cookie Preferences