A consumer GPU running a 26-billion parameter AI model draws somewhere between 150 and 350 watts of power, depending on the hardware. A developer just ran the same size model on 4 watts.
The model is Google's Gemma 4 26B in its A4B form - a 4-bit quantized version, meaning the model's internal data is compressed to use roughly a quarter of the memory compared to full-precision weights, trading a small amount of output quality for a much smaller footprint. The hardware is a Rockchip NPU, the specialized AI processing chip (designed specifically for neural network tasks, separate from the main CPU) found in affordable single-board computers like the Orange Pi 5 and Radxa Rock 5B. These boards cost $80 to $200 and are normally used for lightweight video processing or image recognition, not running models at this parameter count.
The developer made it work using a custom fork of llama.cpp - the open-source inference engine (software that runs a trained AI model to generate responses, as opposed to training one from scratch) that has become the standard for running large language models on consumer hardware without a dedicated GPU. Rockchip NPU support doesn't exist in the mainline project, so this required writing custom low-level code to interface with the chip's AI acceleration features.
The Power Math
4 watts versus 300 watts matters significantly if you're running models continuously. At average US electricity rates, a 300-watt GPU setup running 24 hours a day costs roughly $315 per year. The same workload at 4 watts costs around $4 per year - close to a 99% reduction.
The tradeoff is speed. The Rockchip NPU won't match a dedicated GPU on throughput, and the 4-bit compression reduces output quality slightly compared to the full-precision model. For applications like local document processing, a private AI assistant running on a home server, or edge inference in IoT devices where you need consistent responses rather than peak performance, the math starts to look compelling.
Open Source Doing the Hard Work
The custom llama.cpp fork is publicly available, though it's early-stage. Getting it working requires comfort with compiling from source and debugging hardware-level issues - this isn't a one-click install. Getting Rockchip NPU support into the mainline llama.cpp project would significantly lower the barrier for everyone. That process typically starts exactly like this: someone demonstrates it's possible, publishes the results and code, and the pressure builds for official support.
Google designed the Gemma 4 architecture with edge deployment as an explicit goal, which partly explains why it's a viable candidate for this kind of experiment. Architecture decisions made during model design - layer sizes, attention patterns, quantization behavior - matter enormously when you're trying to run on hardware this constrained. Gemma 4's design made this easier than it would have been with most other models at this parameter count.