Tools Notable

Lenovo's $4,099 ThinkStation PGX Puts 128GB of AI Compute on Your Desk

March 13, 2026 3 min read

What does $4,099 buy you in local AI hardware right now? A box roughly the size of a hardcover book that can run 70-billion-parameter models entirely offline. Lenovo's ThinkStation PGX is the corporate-packaged version of Nvidia's GB10 Grace Blackwell chip, and it fills a gap that no consumer GPU currently can.

The pitch is simple: 128GB of unified memory shared between an Arm CPU and a Blackwell GPU, all in a 1-liter chassis that draws under 200 watts at peak load. For context, an RTX 5090 tops out at 32GB of video memory. That means large language models like Llama 3.2 90B or DeepSeek R1 70B physically cannot load on a consumer GPU. The ThinkStation PGX loads them without breaking a sweat.

The Numbers That Matter

Nvidia rates the GB10 at 1 petaflop of compute at FP4 precision (a low-precision number format used for inference). In practical terms, ServeTheHome's benchmarks show the PGX pushing around 60 tokens per second on 120-billion-parameter models using llama.cpp, and 44 tokens per second on Qwen3 Coder 30B. For smaller models in the 8-14B range, prompt processing hits 5,000-10,000 tokens per second.

Image generation is respectable too: 23 images per minute from Flux.1 at 1K resolution, or roughly one image every 2.6 seconds.

The catch is memory bandwidth. The PGX's LPDDR5X runs at 273 GB/s. Apple's M4 Ultra in the Mac Studio pushes 800 GB/s. That bandwidth gap shows up directly in token generation speed for large models. Running Llama 3.2 90B, you wait over two minutes before the first token appears. DeepSeek R1 70B takes three minutes to start responding. Once it gets going, output trickles at 4-5 tokens per second on those huge models. Usable, but not fast.

Where It Wins and Where It Doesn't

Against the Mac Studio, the PGX has a clear advantage in raw AI throughput. On a 1-million-token benchmark, it finished in 6.7 minutes versus the Mac Studio's 26 minutes, using 58% less energy in the process. It also supports CUDA, which most AI frameworks are built around. The Mac Studio does not.

Against an RTX 5090 desktop, the story flips depending on model size. For small models (7-14B parameters) that fit in 32GB, the 5090 is dramatically faster. Qwen 2.5 7B runs at 220+ tokens per second on the 5090 versus about 47 on the PGX. But the moment you need to run anything above 32GB, the 5090 is out of the conversation entirely. Fine-tuning a 120B model took the PGX about 18 minutes. The 5090 could not even attempt it.

Two PGX units can also be linked together via 200Gbps QSFP ports for running models up to 235 billion parameters, though that doubles your cost to around $8,200.

The Enterprise Angle

Lenovo's version undercuts Nvidia's own DGX Spark (the same chip in Nvidia's own chassis) by roughly $500 on the 1TB storage model. More importantly for IT departments, it comes through Lenovo's existing procurement channels. If your company already buys ThinkPads, ordering a PGX follows the same process.

The machine runs DGX OS, a Linux-based operating system. No Windows support exists yet. That makes it a non-starter for anyone who needs Windows-only tools alongside their AI work.

The 128GB memory is also soldered. There is no upgrade path. What you buy is what you get.

For AI developers and researchers who regularly work with models too large for consumer GPUs, the ThinkStation PGX fills a real gap at a price point that previously required multi-GPU server setups. For everyone else running models that fit in 24-32GB of VRAM, a high-end consumer GPU remains faster and cheaper.

The Numbers That Matter

Where It Wins and Where It Doesn't

The Enterprise Angle

Related Tools

More from today

Claude Now Generates Weather Widgets, Charts, and Recipe Cards Inline

For Blind Users, AI Tools Have Become Essential Accessibility Tech

MCP Hits 97 Million SDK Downloads as Security Gaps Loom Over the Standard

Cookie Preferences