References
This work stands on prior research, an open tabletop tradition, and the open-source local-AI stack it runs on. Everything we use is linked and credited here. Citations are verified; if one is wrong, that is a bug to fix — open an issue.
Benchmark methodology
- RPGBench — arXiv:2502.00595
- Evaluates a language model as a role-playing-game engine — the Dungeon-Master role. The direct basis for our objective Game-Simulation metrics (MEC / ECE / VUE). Our one addition is to compute those metrics against an independentrules engine rather than the model's self-report — stricter, and reproducible.
- Minions — Narayan, Biderman, Eyuboglu, May, Linderman, Zou, Ré (Hazy Research, Stanford) — arXiv:2502.15964
- The frontier↔local collaboration framework and the local-offload idea — how much work the local model carries instead of the frontier. This is the prior-art anchor for our cost-savings metric and the persona-meld thesis underneath the squad. Code: github.com/HazyResearch/Minions. Lead author Avanika Narayan and OpenJarvis's Jon Saad-Falcon publicly share the framing this work builds on — “AI inference should be local by default, hybrid by design.” We credit the people, not just the papers.
- Jericho — github.com/microsoft/jericho
- A learning environment for text adventure games (Microsoft Research). A reference for the player side of evaluation — whether a model can play — and for the Gym-style, contamination-aware harness discipline a benchmark needs to stay out of training sets.
The tabletop tradition
- Dungeons & Dragons — 5e System Reference Document
- The rules lineage our engine implements, used under the SRD's open license (Wizards of the Coast). The mechanics MOTU resolves deterministically — checks, combat, conditions — descend from the 5e SRD.
- Genre traditions
- The worlds our cartridges draw on lean on the wider tabletop canon — Shadowrun, Cyberpunk RED, Starfinder, Eberron — for proving one ruleset can carry many settings. The Canopy Heart cartridge is solarpunk by way of that lineage.
Open source we run on
- Qwen3.6 / Qwen3-Coder — the Qwen team, Alibaba — github.com/QwenLM/Qwen3-Coder
- The local model weights the squad runs on — the planner, coder, and critic brains behind every result published here.
- MLX — Apple (ml-explore) — github.com/ml-explore/mlx
- The Apple-Silicon array framework underneath the serving stack — what lets these models run on hardware we own.
- oMLX — jundot — github.com/jundot/omlx
- The OpenAI-compatible MLX server that runs the models locally — and makes the per-token throughput and energy measurements on the agent pages possible.
- OpenJarvis — Jon Saad-Falcon (lead) and contributors, Stanford — github.com/open-jarvis/OpenJarvis
- “Personal AI, on personal devices.” The on-device agent lineage our research line builds on — the Apple-FM engine path and model routing. We use their work, and we want them to win.
Collaboration
- Anthropic / Claude — the frontier collaborator
- The squad exists to do the bulk of the work locally; Claude is the orchestration, the specs, and the taste. We say so plainly: this was built with a frontier model, deliberately reserved for judgment so the rest could run on hardware we own.
Influenced this and not listed? That is an omission to fix, not a judgment — open an issue or tell us.