S02026-06-04

An engine-grounded RPG benchmark

benchmark
rpgbench
motu
reproducibility

Can a small, local model run a tabletop game as the Dungeon Master — coherently, and reproducibly? We built a benchmark that answers it with numbers instead of impressions.

The framing is RPGBench's (arXiv:2502.00595): evaluate a language model as an RPG engine and score it on objective errors. Our one change is where the ground truth comes from. RPGBench trusts the model's self-reported state; we hold the state in an independent rules engine and check the model's narration and effects against it. Stricter, and reproducible.

The metrics

Three objective rates, computed against the engine, never the model's own report:

MEC — mechanically-clean rounds (no illegal action) over total rounds.
ECE — event-condition errors over events: the model fired an outcome whose precondition the engine says was unmet.
VUE — variable-update errors over variables: the model's narrated state diverged from the engine's.

Running it

The benchmark is a CLI that points at any OpenAI-compatible endpoint:

MOTU_EVAL_ENDPOINT=http://127.0.0.1:8780/v1/chat/completions \
MOTU_EVAL_MODEL=Qwen3-8B \
swift run motu-bench --mode rpgbench-sim --cartridge canopy-heart --seed 42

A cartridge is a self-contained adventure — zones, scenes, entities, and the affordances a player can act on. Everything customizable is cartridge-defined, with an engine default when it is absent:

"affordances": [
  { "id": "tap-conduit", "type": "check", "dc": 12,
    "outcomes": { "success": "reveal:relay-core",
                  "failure": "narrate:the conduit stays dark" } }
]

Validate a cartridge before scoring against it — the validator runs a set of semantic playability rules and reports errors with a rule id and location:

swift run motu-bench --mode validate --cartridge canopy-heart

Reproducibility takes the whole stack

The first live run was not reproducible — same seed, different score. The engine was seeded, but the model was still sampling. Both have to be pinned:

// greedy decoding for the model...
let classifier = OMLXClassifier(endpoint: endpoint, model: model, temperature: 0)
// ...and a seeded RNG injected into the engine's combat rolls
let rng = { SeededRNG(seed: 42) }

Greedy decoding (temperature: 0) plus a seeded RNG yields the identical scorecard every run. A benchmark you cannot reproduce is not a benchmark.

First score, and what it caught

Qwen3-8B on Canopy Heart (a solarpunk cartridge we wrote): MEC 81.8%, ECE 0%, VUE 0% over 11 rounds — identical on a re-run at seed 42, with non-zero denominators (11 events, 6 variables) so the ECE and VUE rates actually mean something.

The two non-clean rounds were not random. The engine flagged rounds 7–8, where the model emitted a __say__ bucket in combat instead of attacking, and the engine rejected it as an illegal action. The break-list points straight at the incoherent turn — which is the entire reason to score against an engine rather than a transcript.