Running Google's Gemma 4 locally just went from impractical to viable. An update to llama.cpp, the most popular tool for running large language models on your own hardware, fixes a bug that caused Gemma 4's KV cache to consume absurd amounts of VRAM.
The KV cache is the memory a model uses to keep track of your conversation as it generates responses. Every token (roughly a word or word fragment) the model processes gets stored there, and it grows as conversations get longer. With Gemma 4, this cache was broken in llama.cpp - it was allocating far more video memory than it should have, to the point where even high-end GPUs couldn't handle normal conversations.
For anyone running Gemma 4 locally, this is the update you've been waiting for. The fix means the model now uses VRAM at expected levels, putting it back in line with what your hardware should actually handle. If you tried Gemma 4 on launch day, hit a wall of out-of-memory errors, and shelved it, now's the time to pull the latest llama.cpp build and try again.