Eval
Run the governed scenario set.
The agent calls modelsmith eval against your cluster. Modelsmith runs the held-out and core scenario sets, scores each against the rubric, and writes the composite to the iterate ledger. The agent reads a pass or fail exit code; no log scraping required.
$ modelsmith eval --quick --cluster rtb-pricing Modelsmith Fleet Status · 2026-04-18 09:12:04 BST cluster rtb-pricing model qwen3-32b-awq iteration 17 scenarios 3 (held-out, representative) phase 3 Robustness eval composite 0.894 PASS (threshold 0.880) held_out 0.872 PASS (threshold 0.860) regressions 0 / 248 duration 4m 12s