Proof

What does a validated specialist look like?

The case for domain-specialist models rests on two structural claims: a 0 to 122B parameter specialist can match a frontier model on a narrow domain, and it does so at a fraction of the inference cost and latency. This page explains how we measure those claims and what the evidence bundle looks like.

How we measure

Three scores. One threshold. No moving targets.

Every Modelsmith eval run produces three scores. The acceptance threshold for each is set by the operator before training begins and locked into the cluster configuration. A model passes promotion only if all three scores clear their thresholds in the same run.

This structure makes the evaluation reproducible and auditable. A regulator or procurement team can read the evidence bundle and verify exactly what the model was scored against and whether the thresholds were met.

Core composite

0.894 / threshold 0.880

The primary gate. A weighted average across all scenario categories for the domain. The threshold is set by the operator before training begins and is not adjusted after the fact.

Held-out composite

0.872 / threshold 0.860

Measured against a scenario set that was never used in training. Confirms generalisation rather than memorisation. A model that passes core but fails held-out is not promoted.

Regression count

0 / 248 scenarios

The number of scenarios that passed in the previous iteration but fail in the current one. Zero regressions is a hard requirement for promotion at Tier B and above.

Cost and latency comparison

The structural case: fixed cost, no egress, no latency ceiling.

Approach
Latency
Inference cost
Data egress
Frontier API (pay-per-token)
200 – 500 ms
Variable; scales with volume
Always
Specialist on owned hardware
Under auction SLA
Fixed infrastructure
Never

Latency figures are indicative and depend on model size, hardware, and workload. Domain-specific benchmark data is available to design partners via the evidence bundle.

Measurement integrity

If the comparison can be gamed,
the claim can’t be trusted.

Publishing how we measure is half of meaning what we say. A specialist model that “matches frontier” on a private benchmark, scored in ways that suit the model, is not a product — it’s a marketing line.

The principles below are the five we found we had to commit to structurally, in code and in the promotion gate, before any frontier-parity claim stood up to scrutiny. They are enforced by CI: a claim on this site that lacks a backing configuration is a broken build.

Frontier-relative promotion gating is in staged rollout. The principles are already committed in the platform’s architectural decision records; the runtime wiring lands in the same release as the first cluster gated on frontier.median.

Frontier comparisons run on scenarios the model has never seen.

Held-out only

Training-visible scenarios are used to teach, never to measure against frontier. Memorisation is not generalisation; we refuse to score partial memorisation against a frontier model reading the scenario for the first time.

Every baseline is pinned to a content hash of its scenario set.

Scenario-set pinning

Edit a scenario — add one, remove one, change a pass criterion, rotate it between visible and held-out — and every baseline that referenced the old set is invalidated the moment the change lands. Silently scoring against an easier set is not a failure mode we can ship.

Candidate models clear the lower CI bound, not the median.

Confidence intervals

Frontier models are stochastic. Each baseline is run multiple times and stored with a paired-bootstrap 95% confidence interval. A one-point win that sits inside evaluator noise does not promote. Superiority must survive the uncertainty, not hide in it.

A tied average with scattered per-scenario wins is a brittle model, not a matched one.

Win rate, not just composite

Promotion requires the candidate to win the majority of scenarios head-to-head against the frontier reference, not merely match the average. Two models can score the same composite while one is consistent and the other erratic; we promote only the consistent one.

At least two of three frontier providers must have current baselines for the cluster.

Multi-provider quorum

Promotion reads Claude, Gemini, and GPT baselines. The comparison runs against the median or lowest of the three — configured per cluster. One provider’s score cannot be the sole yardstick; a single outage or model-version drift cannot move the bar silently.

Evidence bundle

Everything a governance team needs to approve a promotion.

Every successful promotion writes a complete evidence bundle to your cluster. The bundle is self-contained: a reviewer can verify the decision without access to the Modelsmith runtime or the training data.

A redacted real evidence bundle is available to design partners after the Synthetic POC. Apply below to request one.

.iterate/rtb-pricing/promotions/2026-04-18-manual-promote/
promotion-record.json

Immutable signed record: iteration, composite scores, eval tag, adapter path, and timestamp. Written by modelsmith promote; cannot be edited after creation.

eval-transcripts/

Full scenario-level output for every scenario in the held-out and core sets. Each entry includes the model response, rubric score, and ground-truth reference.

rubric.json

The scoring rubric in effect at the time of the eval run. Locked to the eval tag; changes to the rubric create a new tag rather than modifying the existing one.

model-card.md

Domain, base model, training method, training data description (synthetic or real, de-identified), and known limitations. Follows the Hugging Face model card schema.

rollback-procedure.md

Step-by-step instructions for reverting to the previous promoted model. Includes the adapter path and cluster configuration snapshot from the prior promotion.

approval-log.json

The full approver chain: who was asked to review, when, what decision they made, and any conditions attached to the approval. Required for governance audit.

See the receipts

Read how Modelsmith-built specialists stack up against frontier APIs in your vertical.

Every Agentsia Labs benchmark is a real end-to-end run through Modelsmith: synthetic scenarios, eval harness, post-trained specialist, promotion record. Open methodology, published datasets, reproducible numbers.