Skip to main content

Introduction

Getting Started

How a customer team should prepare a first governed Modelsmith specialisation run.

Welcome to the Agentsia Modelsmith documentation. This guide explains the preparation work for a first governed specialisation run: define the benchmark, choose the execution boundary, and decide what evidence will be required before promotion.

Prerequisites

  • Domain owner: someone who can judge whether scenarios represent real work.
  • Benchmark material: representative tasks, pass criteria, and failure criteria.
  • Execution boundary: the customer-controlled environment where training and inference will run.
  • Promotion owner: the person or group that can approve a candidate model.

1. Define the benchmark

Start with 20 to 40 realistic scenarios. Each scenario should include the input the model sees, what a correct answer must contain, common wrong answers, and the capability being tested.

The benchmark becomes the contract for what "better" means. Keep a held-out set separate from any training material.

2. Select the model and boundary

Choose the base model, target cluster, benchmark version, serving target, and promotion threshold. These choices should be recorded before the first run so later score movement has a stable reference point.

3. Start the iterate loop

The iterate loop evaluates the current model, diagnoses failure clusters, specialises on those failures, and returns to evaluation. The customer-visible state should show current phase, latest score, blocked state if any, and whether the candidate is approaching the promotion gate.

4. Review before promotion

Before a candidate ships, review the evidence bundle: benchmark version, score movement, held-out performance, latency, cost-per-decision, lineage, rollback posture, and approver sign-off.