Introduction
Getting Started
How a customer team should prepare a first governed Modelsmith specialisation run.
Welcome to the Agentsia Modelsmith documentation. This guide explains the preparation work for a first governed specialisation run: define the benchmark, choose the execution boundary, and decide what evidence will be required before promotion.
Prerequisites
- Domain owner: someone who can judge whether scenarios represent real work.
- Benchmark material: representative tasks, pass criteria, and failure criteria.
- Execution boundary: the customer-controlled environment where training and inference will run.
- Promotion owner: the person or group that can approve a candidate model.
1. Define the benchmark
Start with 20 to 40 realistic scenarios. Each scenario should include the input the model sees, what a correct answer must contain, common wrong answers, and the capability being tested.
The benchmark becomes the contract for what "better" means. Keep a held-out set separate from any training material.
2. Select the model and boundary
Choose the base model, target cluster, benchmark version, serving target, and promotion threshold. These choices should be recorded before the first run so later score movement has a stable reference point.
3. Start the iterate loop
The iterate loop evaluates the current model, diagnoses failure clusters, specialises on those failures, and returns to evaluation. The customer-visible state should show current phase, latest score, blocked state if any, and whether the candidate is approaching the promotion gate.
4. Review before promotion
Before a candidate ships, review the evidence bundle: benchmark version, score movement, held-out performance, latency, cost-per-decision, lineage, rollback posture, and approver sign-off.