Models

Qwen3.6-35B Beats Gemini 2.5 Pro on Terminal-Bench 2.0 at One-Fourteenth the Size

May 16, 2026 2 min read

Image: Google

24.6%. That's where Qwen3.6-35B-A3B landed on Terminal-Bench 2.0, a public leaderboard that tests how well AI models handle real shell and command-line tasks - commands that have to actually execute correctly, not just look like working code.

The score puts it above Gemini 2.5 Pro's 19.6% on the same benchmark, and above Qwen3-Coder-480B at 23.9%. That second comparison is the more striking one: Qwen3.6-35B-A3B has 35 billion parameters (the numerical values that define a model's behavior and knowledge) while Qwen3-Coder-480B has 480 billion. Getting better terminal-task results from a model roughly one-fourteenth the size is a meaningful result for developers running models locally on their own hardware, where a 480B model simply isn't practical.

Both Qwen models were paired with a scaffold called little-coder, which manages how the model interacts with the terminal environment. Scaffold choice does real work on this kind of benchmark - the same model with different scaffolding often scores very differently. The Gemini comparison used Gemini CLI as its scaffold, so it measures the full official stack rather than the raw model alone.

Qwen3.5-9B with the same scaffold scored 9.2%, which is more modest. A 9-billion-parameter model handling a non-trivial share of real command-line tasks correctly was not the norm 12 months ago.

Terminal-Bench 2.0 is harder than standard coding evaluations because models have to produce commands that actually run and return the right output. There's no partial credit for plausible-looking answers. Gemini 2.5 Pro is one of the stronger commercially available models on coding tasks, which makes the gap to 24.6% a useful reference point.

For developers building local AI coding setups with tools like Aider, Qwen3.6-35B-A3B is now worth adding to your test list - particularly if you want strong terminal performance without the hardware demands of running a 400B+ parameter model.

Related Tools

More from today

Newer AI Models Are Writing Blander Fiction, and Game Devs Have the Evidence

Prompt Injection: The Security Threat Hidden in Every Webpage Your AI Agent Reads

ArXiv Will Ban Authors for a Year Over AI-Written Papers

Cookie Preferences