2026-06-18

The researcher that wouldn't stop

mlx
local-inference
research-agent
harness
rl
apple

Scout — the squad's read-only researcher, the one we fan out wide so the expensive judgment step is fed by cheap local search instead of frontier tokens — swapped brains. Out: Gemma-4-26B, the fast autoregressive workhorse. In: QUEST-35B-RL, an open, RL-trained deep-research agent built on Qwen3.5-35B-A3B.

The swap cost nothing to try. QUEST's architecture is qwen3_5_moe — the same hybrid-attention family our planner and verifier already run — so it loaded into our MLX serving with zero porting: 18 GB at 4-bit, ~137 tok/s on the laptop. A model trained specifically to do research, dropping into the research seat for free. The only question was whether trained-to-research actually beat fast-and-general at Scout's real job.

The bench

We pointed both brains at the same research brief and read the outputs side by side. QUEST won where it counts: it pulled exact figures — mid-training token counts, SFT and RL trajectory numbers — that Gemma rounded past or left as "a large dataset." Its RL training shows in the reflex: it reaches for the primary number, not the gist. For a researcher that feeds a judgment step, the difference between "about 30B tokens" and the real figure is the difference between useful and decorative.

But it only won on the second try.

The part nobody ships: it had to be taught to stop

On the first run QUEST never produced an answer at all. It searched, and searched, and hit the iteration cap with nothing synthesized. A deep-research agent trained to gather has exactly one bad instinct on a budget: it keeps gathering. The capability that makes it good — relentless pursuit of sources — is the same thing that runs it off the end of the wall.

Two harness changes, not a single weight touched, turned it from impressive to usable:

A stop rule baked into Scout's persona: consult a handful of authoritative sources — typically three to five, read the primary one — then stop searching and write the answer. The model has the reach; the harness gives it the off switch.
A wrapper strip: QUEST emits its final reply inside <answer> / <system> tags from its RL training format. We peel them before the answer leaves the seat, so a research result reads like a research result and not like a training artifact.

With those in place and a realistic budget, QUEST converged — and beat Gemma on the same brief.

What it's worth

The honest read: the weights gave Scout a sharper reach, but it was the harness — the stop rule, the budget, the tag strip — that made the reach usable. The same lesson the squad keeps relearning: a model is a capability; a harness is what turns a capability into reliable work. Gemma stays a fallback, and the block-diffusion variant stays a speed dial for bounded fan-out; QUEST is the new default for research that needs to be right about the number.

And it stays exactly what Scout always was — a free, local, read-only recon role, fanned out wide so the frontier model is spent only on judgment. The reach got deeper. The off switch is ours.