218x Speedup: Developer Repurposes Idle GPU Ray Tracing Cores for AI Model Routing

AI news: 218x Speedup: Developer Repurposes Idle GPU Ray Tracing Cores for AI Model Routing

218x. That's the speedup one developer got on the expert routing step of a Mixture of Experts (MoE) AI model - by repurposing hardware designed for video game ray tracing.

To understand why this matters, you need to know what MoE models are. Most large language models are "dense" - every part of the network processes every piece of text you send it. MoE models work differently: they contain multiple specialized sub-networks called "experts," and a routing layer decides which expert handles each token (word fragment) you send. Models like DeepSeek and Mixtral use this architecture. The advantage is that a model can have a very large total parameter count while only activating a small fraction of the network at any given time, making generation faster and less memory-hungry.

The routing step - deciding which expert gets which token - normally runs on the GPU's standard CUDA cores. That works fine, but it leaves other GPU hardware completely idle. Specifically, it leaves RT Cores sitting unused.

What RT Cores Are

RT Cores are dedicated silicon blocks built into Nvidia RTX graphics cards, starting with the RTX 20 series in 2018. Their purpose in games is ray tracing: calculating how light rays bounce off surfaces to produce realistic shadows and reflections. Mathematically, this is a spatial nearest-neighbor search - find which objects in 3D space are closest to a given point. During AI inference (the process of generating a response from a model), these cores have nothing to do. They just sit there.

The developer's insight: MoE routing is also a nearest-neighbor problem. Each token is represented as a point in high-dimensional mathematical space, and the model needs to find which expert is "closest" to it. Project that into 3D coordinates, and RT Cores can handle the search. Running this on an RTX 5070 Ti - a consumer card retailing around $700-800 with 16GB of video memory - produced a 218x speedup versus a standard CUDA routing implementation.

What This Changes for Consumer GPU Users

A 218x routing speedup doesn't translate to 218x faster text generation. Routing is one step in a multi-stage pipeline; attention and feedforward layers remain the main bottlenecks in MoE inference. But two things make this significant.

First, RT Cores have been sitting completely idle in millions of Nvidia GPUs since 2018. Anyone running models on an RTX card already owns this hardware - it just wasn't being used for AI workloads until now.

Second, the entire operation runs on a single consumer GPU with no multi-card setup. People running local models like Mixtral 8x7B or DeepSeek-V2 Lite on their own hardware get faster routing while freeing CUDA cores to focus on the heavier compute steps.

Whether this gets integrated into mainstream local inference tools like llama.cpp or vLLM depends on engineering effort and how well the approach generalizes across GPU generations. But using RT Cores for nearest-neighbor searches in AI workloads is a novel enough idea that it's likely to get serious attention from the local model development community.