Skip to main content

How it works

From baseline to production-validated specialist in four steps.

Modelsmith runs a closed loop: evaluate the current model against your governed scenario set, diagnose failures, post-train on the failure cases, and promote if the composite score clears the threshold. An agent can drive every step. Humans appear only at explicit promotion gates.

Eval

Run the governed scenario set. Composite score vs threshold.

Diagnose

Identify failure patterns. Surface them to the agent or operator.

Post-train

Specialise on failure cases. Monitor convergence.

Promote

Write a signed promotion record. Human reviews the evidence bundle.

The iterate loop

A complete cycle, step by step.

The walkthrough below follows a single adtech cluster through one full iteration. The same loop applies to any domain. The examples show the customer-visible state transitions rather than internal tooling used to execute them.

Eval

Run the governed scenario set.

Modelsmith runs the held-out and core scenario sets, scores each answer against the rubric, and writes the composite result to the iterate ledger. The customer sees pass/fail status, threshold movement, and regression count without needing raw logs.

Evaluation status

  Cluster        rtb-pricing
  Model          qwen3-32b-awq
  Iteration      17
  Scenario set   held-out and representative
  Phase          robustness evaluation

  Composite      0.894   pass  (threshold 0.880)
  Held-out       0.872   pass  (threshold 0.860)
  Regressions    0 / 248
  Duration       4m 12s

Diagnose

Surface failure patterns.

When a run fails, Modelsmith groups failed scenarios by capability, rubric miss, and operator lens. Each failure has a score, an explanation, and the reference criteria needed for review. The next run targets that evidence instead of starting a broad retraining pass.

Failure cluster

  Capability       bid-floor optimisation
  Pattern          confuses pacing throttle with bid shading
  Failed cases     7
  Severity         high
  Reviewer action  rubric is sound; continue targeted training
  Next target      causal disambiguation under first-price auctions

Post-train

Fine-tune on failure cases.

The training job targets the failure cases from the previous evaluation. Progress is reported as a governed run state: queued, active, assessing, promoted, blocked, or rolled back. When the job completes, the loop returns to evaluation and measures whether the composite improved.

Training progress

  Cluster          rtb-pricing
  Target           failed causal-disambiguation scenarios
  Method           governed post-training
  Status           active
  Estimated time   38m
  Next phase       evaluation against held-out set

Promote

Write a signed promotion record.

When the composite clears the threshold, Modelsmith stages a promotion record: iteration, score, benchmark version, lineage, approver chain, and rollback posture. A human reviewer reads the evidence bundle and approves or rejects the promotion in the state machine.

Promotion request

  Cluster          rtb-pricing
  Model            qwen3-32b-awq
  Iteration        17
  Composite        89.4%
  Benchmark        adtech benchmark v2
  Evidence         ready for reviewer sign-off
  Rollback         contract attached

Client operating surfaces

Every surface is auditable.

Modelsmith remains agent-operated under the hood, but the public surface is framed around customer trust. Human reviewers set rubrics, approve promotions, and inspect evidence bundles. Everything between those gates runs with a recorded state trail.

Benchmark workbench

Domain owners author scenarios, pass criteria, failure criteria, and held-out coverage. The benchmark stays readable enough for a non-engineering expert to challenge.

  • scenario coverage
  • rubric review
  • held-out balance

Model intake

Operators select the base model, cluster, benchmark version, hardware boundary, and promotion threshold before a run starts.

  • base model
  • target cluster
  • promotion threshold

Iterate ledger

Every evaluation, training step, assessment, and blocked state is recorded with the evidence that led to the next decision.

  • phase state
  • score movement
  • failure clusters

Fleet boundary

Training and inference run on the approved customer-controlled target, while shared evidence stays limited to summaries and scorecards.

  • serving target
  • capacity notes
  • redacted evidence

Promotion gate

A candidate advances only when the benchmark threshold, operational checks, rollback posture, and approver sign-off are satisfied.

  • quality threshold
  • approver chain
  • rollback posture

Evidence bundle

The review packet captures benchmark version, score movement, lineage, operational measurements, and remaining risk before approval.

  • benchmark version
  • lineage
  • operational risk

See the receipts

Read how Modelsmith-built specialists stack up against frontier APIs in your vertical.

Every Agentsia Labs benchmark is a real end-to-end run through Modelsmith: synthetic scenarios, eval harness, post-trained specialist, promotion record. Open methodology, published datasets, reproducible numbers.