Related ToolsAiderClaude CodeCursorCody

Qwen3.6-35B Beats Gemini 2.5 Pro on Terminal-Bench 2.0 at One-Fourteenth the Size

Google Gemini
Image: Google

24.6%. That's where Qwen3.6-35B-A3B landed on Terminal-Bench 2.0, a public leaderboard that tests how well AI models handle real shell and command-line tasks - commands that have to actually execute correctly, not just look like working code.

The score puts it above Gemini 2.5 Pro's 19.6% on the same benchmark, and above Qwen3-Coder-480B at 23.9%. That second comparison is the more striking one: Qwen3.6-35B-A3B has 35 billion parameters (the numerical values that define a model's behavior and knowledge) while Qwen3-Coder-480B has 480 billion. Getting better terminal-task results from a model roughly one-fourteenth the size is a meaningful result for developers running models locally on their own hardware, where a 480B model simply isn't practical.

Both Qwen models were paired with a scaffold called little-coder, which manages how the model interacts with the terminal environment. Scaffold choice does real work on this kind of benchmark - the same model with different scaffolding often scores very differently. The Gemini comparison used Gemini CLI as its scaffold, so it measures the full official stack rather than the raw model alone.

Qwen3.5-9B with the same scaffold scored 9.2%, which is more modest. A 9-billion-parameter model handling a non-trivial share of real command-line tasks correctly was not the norm 12 months ago.

Terminal-Bench 2.0 is harder than standard coding evaluations because models have to produce commands that actually run and return the right output. There's no partial credit for plausible-looking answers. Gemini 2.5 Pro is one of the stronger commercially available models on coding tasks, which makes the gap to 24.6% a useful reference point.

For developers building local AI coding setups with tools like Aider, Qwen3.6-35B-A3B is now worth adding to your test list - particularly if you want strong terminal performance without the hardware demands of running a 400B+ parameter model.