Benchmark
The evaluation harness is open. Clone the repository, run it against your own model (a local model or any OpenAI-compatible endpoint), and submit the results; entries are added to the leaderboard. Methods and per-case results are published with each run.
Leaderboard
Leaderboard — forthcoming.