S02026-06-02

Coder-model evaluation: Qwen3-Coder-Next

evaluation
coder
aider
executor

A candidate coder model for the squad's executor agent was evaluated against the incumbent on the Aider Polyglot Python subset — 34 exercises, execution-graded, two attempts with test feedback after the first. The harness was held fixed so the model was the only variable.

Method

Each exercise was run up to twice: an initial attempt, then a second attempt with failing test output passed back as context. The score recorded is pass@2 — whether either attempt produced a passing solution. Both models ran on the same hardware, served locally over an OpenAI-compatible endpoint at the same temperature and context window:

MODEL_ENDPOINT=http://127.0.0.1:8780/v1/chat/completions
# same temperature, same context window for both models; no prompt changes

Results

| Model | pass@2 (34 exercises) | Median latency | | --- | --- | --- | | Qwen3-Coder-Next (candidate) | 76% | comparable | | incumbent | 29% | comparable |

The candidate more than doubled the incumbent's pass@2 with no change to the harness, and was adopted as the executor agent's model.

The harness bug behind the first numbers

The earliest runs were wrong for a reason that had nothing to do with the model: the benchmark assumed a small context window (~4–8K tokens) for local models and drove them into context-exhaustion loops — one exercise (paasio) ground for 42 minutes. The fix was a one-line model-metadata entry declaring the model's real window:

{ "qwen3-coder-next": { "max_input_tokens": 262144 } }

With the true window declared, the same exercise finished in 84 seconds. The lesson generalizes: a local-model benchmark that inherits a hosted model's context assumptions measures the harness, not the model. We fixed the harness, then re-ran — and only then trusted the numbers.