Client operating surfaces
Every surface is auditable.
Modelsmith remains agent-operated under the hood, but the public surface is framed around customer trust. Human reviewers set rubrics, approve promotions, and inspect evidence bundles. Everything between those gates runs with a recorded state trail.
Domain owners author scenarios, pass criteria, failure criteria, and held-out coverage. The benchmark stays readable enough for a non-engineering expert to challenge.
- scenario coverage
- rubric review
- held-out balance
Operators select the base model, cluster, benchmark version, hardware boundary, and promotion threshold before a run starts.
- base model
- target cluster
- promotion threshold
Every evaluation, training step, assessment, and blocked state is recorded with the evidence that led to the next decision.
- phase state
- score movement
- failure clusters
Training and inference run on the approved customer-controlled target, while shared evidence stays limited to summaries and scorecards.
- serving target
- capacity notes
- redacted evidence
A candidate advances only when the benchmark threshold, operational checks, rollback posture, and approver sign-off are satisfied.
- quality threshold
- approver chain
- rollback posture
The review packet captures benchmark version, score movement, lineage, operational measurements, and remaining risk before approval.
- benchmark version
- lineage
- operational risk