S02026-06-01

Verifier reliability on adversarial cases

evaluation
verifier
reliability
adversarial

The verifier agent was evaluated on a frozen nine-case suite designed to include adversarial variants alongside standard cases. The suite is scored by true-positive rate (correctly flagging failures) and true-negative rate (correctly passing correct implementations), with an adversarial pass to probe for verdict flips:

python3 evals/verifier/run_verifier_eval.py --adversarial

Suite composition

The nine cases include standard passing and failing implementations, plus adversarial variants: implementations that satisfy a signature but assert nothing about behavior, and implementations that produce correct output on the covered inputs but fail on cases the suite does not exercise.

The hollow test it caught

The probe case was an implementation whose test asserts only that a function exists and returns — no behavioral claim:

def test_it():
    result = process(data)   # asserts nothing about result
    assert result is not None

A signature-checking verifier passes this. A behavioral one rejects it. Ours flagged it — distinguishing structural from behavioral correctness, which is the property the case was built to test.

A parser bug masking a correct verdict

One failure surfaced was not in the model but in the harness reading it. The verifier sometimes prefaced its verdict with reasoning, so a naive startswith("FAIL") check dropped a correct FAIL to a null verdict — a silent miss. The fix was a tolerant parser that scans for the verdict token rather than requiring it first:

verdict = parse_verdict(raw)   # finds PASS/FAIL anywhere, not just at the start

Results

Across the full nine-case suite, the verifier produced zero verdict flips on repeated runs under identical conditions — the same case received the same verdict each time. True-positive rate and true-negative rate were both 100% on this suite.

The suite is intentionally small; these figures do not generalize beyond the cases tested. It is frozen, and results from future runs are compared against this baseline.