References

This work stands on prior research, an open tabletop tradition, and the open-source local-AI stack it runs on. Everything we use is linked and credited here. Citations are verified; if one is wrong, that is a bug to fix — open an issue.

Benchmark methodology

RPGBench — arXiv:2502.00595: Evaluates a language model as a role-playing-game engine — the Dungeon-Master role. The direct basis for our objective Game-Simulation metrics (MEC / ECE / VUE). Our one addition is to compute those metrics against an independentrules engine rather than the model's self-report — stricter, and reproducible.
Minions — Narayan, Biderman, Eyuboglu, May, Linderman, Zou, Ré (Hazy Research, Stanford) — arXiv:2502.15964: The frontier↔local collaboration framework and the local-offload idea — how much work the local model carries instead of the frontier. This is the prior-art anchor for our cost-savings metric and the persona-meld thesis underneath the squad. Code: github.com/HazyResearch/minions. Lead author Avanika Narayan frames it on X as local by default, hybrid by design — the stance this work builds on. We credit the people, not just the papers.
Intelligence per Watt — Saad-Falcon, Narayan, … Ré (Stanford / Hazy Research / Scaling Intelligence Lab) — arXiv:2511.07885: Proposes intelligence per watt — task accuracy per unit of power — as the metric for local-inference viability, from a study across 20+ local models, 8 accelerators, and roughly a million real-world queries. The direct anchor for the per-token energy numbers on our agent pages. Our extension is to measure coverage and efficiency on sustained, multi-step software work rather than single-turn queries — the same authors as Minions and OpenJarvis, already credited here.
Jericho — github.com/microsoft/jericho: A learning environment for text adventure games (Microsoft Research). A reference for the player side of evaluation — whether a model can play — and for the Gym-style, contamination-aware harness discipline a benchmark needs to stay out of training sets.

Local-AI evaluation & self-improvement

Cross-Domain Mission Autonomy (Autonomy-Necessity-Score) — de Curtò et al. — arXiv:2603.28926: Scores LLMs on physics-grounded space-mission decision scenarios; its best 70B-class model reaches ~80% via cloud API within a 2 s budget. The methodology our Low Power Edge benchmark mirrors — our extension is to run that task family on smallmodels at the lowest-power edge (an 8 GB Jetson, ≤17 W) and report tokens/sec per watt, the dimension the original leaves open.
HumanEval — Chen et al. (OpenAI) — arXiv:2107.03374: The code-synthesis-from-docstring benchmark; the base of our coder right-sizing eval. We use it honestly — and report that it is now saturated for modern coders (every contender ~97–100%), which is exactly why the next reference matters.
EvalPlus / HumanEval+ — Liu, Xia, Wang, Zhang — arXiv:2305.01210: Extends HumanEval ~80× with augmented test inputs scored against an oracle, catching code that only passes the three toy tests. Our discriminating measure when the base benchmark saturates — our harness uses its base-vs-plus method directly.
Self-Harness: Harnesses That Improve Themselves — Zhang et al. — arXiv:2606.09498: Agents that improve their own operating harness: mine model-specific weaknesses from execution traces, propose harness changes, validate before accepting. On Terminal-Bench-2.0, frozen weightsgain +15–21 points from the harness alone. The direct evidence for the thesis underneath our work — harness over weights, sovereignty over the learning loop — and a blueprint for the squad's own self-improvement loop.

The tabletop tradition

Dungeons & Dragons — 5e System Reference Document: The rules lineage our engine implements, used under the SRD's open license (Wizards of the Coast). The mechanics MOTU resolves deterministically — checks, combat, conditions — descend from the 5e SRD.
Genre traditions: The worlds our cartridges draw on lean on the wider tabletop canon — Shadowrun, Cyberpunk RED, Starfinder, Eberron — for proving one ruleset can carry many settings. The Canopy Heart cartridge is solarpunk by way of that lineage.

Open source we run on

Qwen3.6 / Qwen3-Coder — the Qwen team, Alibaba — github.com/QwenLM/Qwen3-Coder: The local model weights the squad runs on — the planner, coder, and critic brains behind every result published here.
MLX — Apple (ml-explore) — github.com/ml-explore/mlx: The Apple-Silicon array framework underneath the serving stack — what lets these models run on hardware we own.
oMLX — jundot — github.com/jundot/omlx: The OpenAI-compatible MLX server that runs the models locally — and makes the per-token throughput and energy measurements on the agent pages possible.
OpenJarvis — Jon Saad-Falcon (lead) and contributors, Stanford — github.com/open-jarvis/OpenJarvis: “Personal AI, on personal devices.” The on-device agent lineage our research line builds on — the Apple-FM engine path and model routing. We use their work, and we want them to win.

Collaboration

Anthropic / Claude — the frontier collaborator: The squad exists to do the bulk of the work locally; Claude is the orchestration, the specs, and the taste. We say so plainly: this was built with a frontier model, deliberately reserved for judgment so the rest could run on hardware we own.

Influenced this and not listed? That is an omission to fix, not a judgment — open an issue or tell us.