The tool calls we were throwing away
- quest
- deep-research
- harness
- local-ai
- agentic-engineering
- ingest
- sovereignty
I wanted my local squad to have a real deep-research agent. Not a chat model I prompt to "go research," but something trained to run for a hundred turns, read sources, and synthesize. OSU-NLP built exactly that and open-sourced it: QUEST, a 35B mixture-of-experts model, RL-trained on its own research harness, Apache licensed. I named the agent Merlin and gave him an owl, Archimedes, a tiny model that folds his context down as he runs so a long run doesn't blow the window.
The first runs were bad. He would search a couple of times and then make something up. Asked about a repo he couldn't reach, he invented a confident description with fabricated URLs. The easy read is "the model is weak." I almost wrote that down.
It was wrong. We were serving him wrong.
QUEST emits its tool calls as text, the way it was trained. Our MLX server was being helpful: it parsed those calls into the structured field the OpenAI format uses, and put the model's thinking in a separate reasoning field. Our harness read only the plain content field. Which was now empty. So every turn looked like an empty response, the agent retried, gave up, and on the last turn did the only thing left to it and hallucinated.
The model was calling its tools correctly the whole time. We were dropping them on the floor.
The fix was to rebuild what QUEST actually emitted, the thinking and the tool call, from the pieces the server had split apart. The moment that landed, Merlin stopped guessing. Asked about a repo he couldn't open, he now said plainly that he couldn't verify it and refused to invent one. Then we fixed the next thing (he kept guessing the default branch as "main" when ours was "master," a one-word 404) and he read the real code.
It took about a dozen of these. A tokenizer that wouldn't load. A chat template we had to carry over by hand. Reasoning tensors. Each one a small, honest bug sitting between a capable model and the work it was built to do. None of them the model's fault.
What he does now: point him at a repository and he maps the tree, reads the load-bearing files, and writes a cited knowledge state into the squad's shared memory. He has read the Mycelium platform itself, 18 files across the server, the routes, the SDK, the MCP. He has read this app, the private one that never leaves the Mac, through a local file tool scoped to the project and blind to anything that looks like a secret. Both now sit in the vector memory, retrievable.
Credit where it's owed. We didn't train Merlin's brain. OSU-NLP did, and they shipped the harness that taught me how a deep-research agent is meant to be driven. We ran it as designed, on hardware we own, and the only thing we added was the plumbing to serve it honestly.
The lesson is the one I keep relearning. A capable model doesn't fail or fly on its weights. It fails or flies on whether the thing around it gets out of the way. The model was never the problem. The harness was. It usually is.