Skip to main content

Platform

Evaluation harness

The public assay-harness concepts behind reproducible scenario scoring and Agentsia Labs benchmarks.

The public evaluation harness provides a reproducible way to score language models against structured scenarios. It is the clean-room evaluation layer used for Agentsia Labs benchmarks; Modelsmith uses the same benchmark discipline inside governed specialisation runs.

Core concepts

Scenario

A scenario is one realistic task with inputs, metadata, expected evidence, and rubric criteria. Good scenarios read like real operator work, not textbook questions.

Runner

A runner sends the scenario to a model through a configured provider or local serving target. The harness records enough run metadata to compare results without depending on a single vendor.

Rubric

A rubric turns the response into a score. It should name what a correct answer must include, what common wrong answers look like, and which failure modes are critical.

Result record

Each run produces a result record with scenario id, model label, score, latency, and rubric outcome. Published benchmarks should use redacted records that are safe to share.

Relationship to Modelsmith

The harness measures. Modelsmith acts on the measurement: it evaluates the current model, diagnoses failure clusters, specialises on the failures, records the iterate ledger, and gates promotion on the held-out benchmark.

Reproducibility standard

A benchmark release should state:

  • benchmark version and scenario count
  • scenario coverage by capability and outcome type
  • runner configuration at the level needed to repeat the score
  • scoring rubric and pass threshold
  • redaction boundary for any shared result records