Intelligence per watt: what we measure, and what we don't
- energy
- intelligence-per-watt
- method
- local-inference
We publish two numbers about the local squad that the field rarely reports together: how much of the work runs locally, and the energy each token costs. We deliberately do not publish a third — dollar cost — because it is deployment-specific and proves less than it appears to. Energy per token is a property of the model and the hardware; anyone with the same parts can reproduce it.
What we measure
For each agent's brain, on the machine that actually serves it:
- Throughput — tokens per second, from the server's own usage accounting.
- Energy per token — joules per token: combined CPU+GPU+ANE power (Apple's
powermetrics) sampled over a sustained generation, divided by throughput. - Power — the average watts drawn while generating.
The numbers on the agent pages come from this, nothing modeled:
# prewarm so the measured window is pure inference, not model load,
# then sample Combined Power over a window that overlaps a long generation
sudo powermetrics --samplers cpu_power,gpu_power -n 60 -i 500 | grep "Combined Power"
# joules/token = average inference watts ÷ tokens-per-second
A > 2 W filter drops idle samples so the average reflects generation, not the gaps around it. Measured this way, the squad's brains span roughly 0.39 J/token (an 8-bit coder at ~77 tok/s) to 1.27 J/token (a 27B planner at ~25 tok/s) on the same Mac — the cost of carrying a token locally, in joules you can check.
Standing on intelligence-per-watt
The framing is not ours. Saad-Falcon, Narayan, and colleagues (Stanford / Hazy Research) propose intelligence per watt — task accuracy per unit of power — as the metric for local-inference viability, across more than twenty local models and eight accelerators (arXiv:2511.07885). They report local models already answering 88.7% of single-turn queries, with locally-serviceable coverage rising from 23.2% to 71.3% in two years.
Our work sits one axis over. Their study measures single-turn chat and reasoning; we measure sustained, multi-step software work — plan, code, verify, repeat — carried by a roster of local models with a frontier model reserved for judgment. The question is not whether a local model can answer a prompt, but how much of a real engineering loop it can carry, and at what energy.
The part we have not measured
There is an obvious next claim, and we want to be precise that it is a direction, not a result. As the local roster is trained on the supervisor's own outputs, does the share of work it can carry rise over time — at held quality? We have the instrument: a frozen verifier suite that adjudicates squad output the same way every run, and a transcript record we stopped garbage-collecting so the series is continuous from here. We do not have the time-series yet, so we are not going to draw the curve. If quality drops as local share rises, distillation is not recovering capability; if the local share never moves, the traces are not transferring. Those are the falsifiers. We will report the lines when we have run them.