1.15 gigabytes. That's the total memory footprint of PrismML's new Bonsai 8B model - a full large language model that fits comfortably on an iPhone. For comparison, a standard 8-billion parameter model in full precision needs roughly 16 GB, which rules out most consumer hardware without cloud offloading.
PrismML, a startup out of Caltech led by electrical engineering professor Babak Hassibi, released Bonsai today in three sizes (1.7B, 4B, and 8B parameters) under the Apache 2.0 license. The models are available now on GitHub and Hugging Face.
How 1-Bit Weights Actually Work
In a typical LLM, each "weight" (one of billions of tiny numerical values the model uses to process language) is stored as a 16-bit or 32-bit floating point number. That precision adds up fast across billions of parameters.
Bonsai takes a radical approach: each weight is reduced to just its sign, positive or negative (+1 or -1), with a single shared scaling factor for each group of weights. That's the "1-bit" part. Instead of storing a precise number like 0.0342, the model just stores "positive" and adjusts the magnitude at the group level.
The result is a model that PrismML claims is 14x smaller than its full-precision equivalent, runs 8x faster on edge hardware (phones, laptops, tablets), and uses 5x less energy.
Do the Benchmarks Hold Up?
The obvious question: does crushing a model down to 1-bit weights destroy its quality? PrismML says Bonsai 8B remains competitive with standard 8B-class models on common benchmarks including MMLU Redux (general knowledge), MuSR (reasoning), and GSM8K (math).
The company highlights what it calls "intelligence density" - benchmark performance divided by model size in gigabytes. Bonsai 8B scores 1.06 per GB on this metric versus 0.10 per GB for Qwen3 8B. That's a 10x difference in how much capability you get per gigabyte of memory.
Those are PrismML's own numbers, and independent benchmarks will tell the real story. But even if there's some quality loss at the margins, a model that runs natively on a phone without touching a server is solving a different problem than a cloud model chasing leaderboard scores.
Who This Is For
Bonsai runs natively on Apple devices (Mac, iPhone, iPad) through Apple's MLX framework and on Nvidia GPUs via llama.cpp. That covers a wide range of hardware most people already own.
The practical use cases are situations where you need an LLM but can't or don't want to send data to the cloud: offline use, privacy-sensitive tasks, low-latency applications, or just avoiding API costs. A 1.15 GB model that runs on-device at 8x normal speed starts to look like a viable local assistant rather than a toy demo.
The 1-bit quantization approach has been explored in research papers for a couple of years now, but PrismML shipping production-ready models with an Apache 2.0 license makes it real for developers who want to build on-device AI products today. The gap between cloud-only and on-device LLMs just got noticeably smaller.