How it works

From baseline to production-validated specialist in four steps.

Modelsmith runs a closed loop: evaluate the current model against your governed scenario set, diagnose failures, post-train on the failure cases, and promote if the composite score clears the threshold. An agent can drive every step. Humans appear only at explicit promotion gates.

Eval

Run the governed scenario set. Composite score vs threshold.

Diagnose

Identify failure patterns. Surface them to the agent or operator.

Post-train

Fine-tune on failure cases with GRPO. Monitor convergence.

Promote

Write a signed promotion record. Human reviews the evidence bundle.

The iterate loop

A complete cycle, step by step.

The walkthrough below follows a single adtech cluster through one full iteration. The same loop applies to any domain. CLI commands and MCP tool calls are interchangeable; Claude Code, Cursor, and Codex drive both surfaces.

Eval

Run the governed scenario set.

The agent calls modelsmith eval against your cluster. Modelsmith runs the held-out and core scenario sets, scores each against the rubric, and writes the composite to the iterate ledger. The agent reads a pass or fail exit code; no log scraping required.

$ modelsmith eval --quick --cluster rtb-pricing

  Modelsmith Fleet Status  ·  2026-04-18 09:12:04 BST

  cluster        rtb-pricing
  model          qwen3-32b-awq
  iteration      17
  scenarios      3 (held-out, representative)
  phase          3 Robustness eval

  composite      0.894   PASS  (threshold 0.880)
  held_out       0.872   PASS  (threshold 0.860)
  regressions    0 / 248
  duration       4m 12s

Diagnose

Surface failure patterns to the agent.

When a run fails, the agent calls modelsmith_eval_results or modelsmith_compare to fetch the structured failure log. Each failed scenario has a rubric score, a model response, and a ground-truth reference. The agent uses this to generate a targeted post-training dataset or escalate to a human for rubric review.

// agent -> modelsmith_eval_run

{
  "tool": "modelsmith_eval_run",
  "args": {
    "tag": "rtb-pricing-iter-17-core",
    "cluster": "rtb-pricing",
    "workers": 2,
    "type": "core",
    "domain": "adtech"
  }
}

// response

{
  "jobId": "a3f82c41",
  "status": "started",
  "tag": "rtb-pricing-iter-17-core",
  "log": ".iterate/rtb-pricing/logs/eval-a3f82c41.log"
}

Post-train

Fine-tune on failure cases.

The agent starts a GRPO training job via modelsmith_train_start, targeting the failure cases from the previous eval run. Training progress is queryable via modelsmith_train_status. When the job completes, the agent loops back to eval and measures whether the composite improved.

// agent -> modelsmith_train_start

{
  "tool": "modelsmith_train_start",
  "args": {
    "cluster": "rtb-pricing",
    "eval_tag": "rtb-pricing-iter-17-core",
    "method": "grpo",
    "target": "failures"
  }
}

// response

{
  "jobId": "t9c14e82",
  "status": "queued",
  "cluster": "rtb-pricing",
  "estimated_duration_minutes": 38
}

Promote

Write a signed promotion record.

When the composite clears the threshold, the agent calls modelsmith promote. Modelsmith writes an immutable promotion record to the cluster ledger: iteration, composite, eval tag, and adapter path. A human reviewer reads the evidence bundle and approves or rejects the promotion in the state machine.

$ modelsmith promote rtb-pricing

  Promotion request
    Cluster:   rtb-pricing
    Model:     qwen3-32b-awq
    Iteration: 17
    Composite: 89.4%
    Eval tag:  qwen3-32b-tq-rtb-pricing-iter17-core
    Adapter:   /iterate/rtb-pricing/adapters/iter-17.safetensors

  Promotion record written.
    Directory: .iterate/rtb-pricing/promotions/
    Latest:    2026-04-18-manual-promote.json

Agent-native surfaces

Every surface is typed and scriptable.

Modelsmith is designed to be operated by agents. The CLI and MCP server both produce machine-readable output. Human reviewers set rubrics, approve promotions, and inspect evidence bundles. Everything between those gates runs without intervention.

CLI

Available

The modelsmith bash wrapper and Python CLI expose typed flags and JSON-shaped output with deterministic exit codes. Safe to call from a Claude Code, Cursor, or Codex session without a subprocess wrapper.

  • modelsmith eval
  • modelsmith promote
  • modelsmith status

MCP server

Available

Nine tools available today across status, evaluation, and training groups. Each tool accepts a JSON arguments object and returns a schema-validated JSON response. Unknown tools are refused; malformed arguments return a typed error envelope.

  • modelsmith_eval_run
  • modelsmith_train_start
  • modelsmith_fleet_status

Promotion and governance tools

Coming soon

Promote, rollback, evidence-bundle export, and policy-gate enforcement. These tools exist in the codebase as planned control-plane modules and are being wired into the canonical MCP settings.

  • promote
  • evidence_bundle
  • policy_gate

See the receipts

Read how Modelsmith-built specialists stack up against frontier APIs in your vertical.

Every Agentsia Labs benchmark is a real end-to-end run through Modelsmith: synthetic scenarios, eval harness, post-trained specialist, promotion record. Open methodology, published datasets, reproducible numbers.