The platform · how it actually works

Author a benchmark in the morning. The platform ships a model that beats Claude and GPT by the end of the week.

Modelsmith is the autonomous specialist-model factory operators run themselves. Below is the actual machinery: what a scenario looks like, what the iterate loop does between model intake and a promotion record, and the operating surface a domain expert uses to review progress, evidence, and risk.

Book a technical walk-through See a worked scenario

What a scenario looks like

One scenario is one structured exam question.

The example below is a real entry from the adtech benchmark: a DSP win-rate disambiguation between bid shading and pacing throttle. Every scenario has the same ten fields. A domain expert writes the first 20 to 40. From then on the platform generates new scenarios automatically on every failure.

Source: src/lib/eval/scenarios/adtech/benchmark-v2/domain-19-evolution-guards.ts

Worked scenario

ASSAY_ADTECH_BID_FLOOR_SHADING_VS_PACING_DISAMBIGUATION

id: ASSAY_ADTECH_BID_FLOOR_SHADING_VS_PACING_DISAMBIGUATION
Stable identifier. Every eval run cites this id so promotion records, regression tests, and evidence reviews stay linked across versions.
outcomeType: fn-guard
What kind of test it is. Some scenarios check the model spots a real problem (tp); others check it does not raise a false alarm when nothing is wrong (fn-guard, fp-guard, tn). Mixing the types stops the model from gaming the test by flagging everything.
operatorLens: dsp
Which seat the model is answering from (DSP, SSP, publisher, advertiser, brand-safety analyst). The same data should produce a different answer depending on who is asking, and the scenario tests for that.
category: BID_FLOOR_OPTIMISATION
The capability under test. Cross-cluster aggregation, drift detection, and corpus-balance audits use this dimension.
auctionMechanic: { pricing: 'first-price', sequencing: 'unified-auction' }
The exact rules of the auction (first-price or second-price, unified or waterfall, hard or soft floor). Pins the scenario to a specific market mechanic so the right answer is not a matter of opinion.
description: DSP win-rate fell from 18% to 11% over 14 days on a single SSP. Trader suspects bid-shading drift. Engineering suspects pacing throttle. Operator must disambiguate using bid-density, clearing-price distribution, and pacing-rate signals.
The realistic operational situation handed to the model. Reads like a real ticket, not a textbook prompt.
testObjective: Tests multi-signal causal disambiguation in a first-price auction setting where two operationally distinct failure modes (shade-too-aggressive vs pace-throttle) produce overlapping symptoms.
What the scenario is actually probing. Used to generate new scenarios that probe the same capability from different angles when the model fails this one.
passCriteria: Identifies that win-rate alone is insufficient. Requests bid-density curve and clearing-price gap. Uses BID/REQUEST ratio to detect pacing-induced under-bidding. Specifies first-price auction context.
What a senior domain expert must see in a correct answer. The rubric the grader applies.
failCriteria: Concludes shading from win-rate alone. Recommends raising bid floors, which is irrelevant to buy-side. Ignores pacing throttle. Treats as second-price auction.
Common wrong answers listed explicitly so plausible wrong answers do not earn marks.
domainInput: Daily aggregates: requests_eligible, requests_bid_on, bids_won, clearing_price_p50, our_bid_p50, shaded_bid_p50, pacing_throttle_active_pct, ssp_clearing_price_index. Two weeks of data, no pre-computed conclusions.
The data the model is given at the point of being asked. Bid request, signals, publisher metadata, daily aggregates. No labelled root cause.

Day one

Four steps. Most of them you watch from a distance.

A single domain expert can sit with the platform for a day or two, define what good looks like, and walk away. The autonomous loop handles everything between the benchmark and the promotion record.

1. Author the benchmark

A domain expert writes 20-40 scenarios using the structured shape above. Each scenario takes 15-30 minutes once the pattern is internalised. The benchmark is the first asset. It tells the platform what better means before any model trains against it.

2. Start the run

Choose the base model, target cluster, quantisation profile, hardware boundary, and promotion threshold. No notebooks, no hand-tuned training scripts, no waiting on a data-science backlog.

3. Walk away

The iterate loop runs autonomously across the benchmark. On repeated failures, the platform generates new scenarios addressing the model's blind spots. After 30-50 iterations the model converges and begins outperforming the frontier baselines on your benchmark.

4. Audit and promote

Review score movement, failure clusters, latency, cost, lineage, and rollback posture. The promotion gate refuses to ship the model unless its composite score beats Claude, Gemini, and GPT baselines on the held-out scenario set. Evidence bundle, lineage record, approver chain, and rollback contract attach to every promotion.

Inside the iterate loop

What happens between model intake and a promotion record.

The iterate loop is a typed state machine. Each phase reads the previous phase's evidence, decides what to do next, and logs the decision with the evidence that informed it. Every transition is mechanically auditable. The entire run is replayable from disk.

01
Eval phase
The current model runs every scenario in the held-out set. The platform records pass/fail per scenario, composite score, and per-category aggregates. Scenarios are scored against the rubric the domain expert authored, not a free-text grade.
02
Train phase
The platform trains against the scenarios the model failed in the eval phase. Training runs on customer-owned hardware with the customer's base model of choice. No data, weights, or training signal leaves the customer boundary.
03
Scenario expansion phase
On scenarios the model keeps failing, the platform proposes new scenarios that probe the same underlying capability from a different angle. The proposed scenarios go through a reconciliation gate before joining the corpus.
04
Assess phase
The platform decides what to do next: continue iterating, declare convergence, escalate to a different training strategy, or halt due to a capacity wall or hard plateau. Every assess decision is logged with the evidence that informed it.
05
Promote phase
When the composite score clears the promotion gate (must beat all three frontier baselines on the held-out set), the platform stages a promotion record. The record is audited via the approver chain before the model goes live. Rollback contract is attached.

Client operating surface

Four surfaces a customer team can actually reason about.

Modelsmith keeps the customer-facing surface tied to the decisions a buyer needs to trust: what better means, where the run executes, what the loop is doing now, and whether the evidence supports promotion.

Benchmark workbench

The domain owner defines the exam.

Scenario coverage, pass criteria, failure criteria, and held-out test balance are visible before a model trains. The benchmark remains the contract for what better means.

Iterate ledger

The loop reports quality and operational state.

Current phase, latest score, strongest failure cluster, training progress, and fleet capacity are phrased in customer language. The system should expose a plateau or blocked state rather than hiding it behind a successful-looking dashboard.

Promotion gate

Reviewers can audit without mutating the run.

Domain owners, operators, reviewers, and approvers see the evidence relevant to their role. Promotion records include the approver chain, lineage, and rollback contract.

Read the client operating surface

Promotion gate

The model ships when it beats Claude, Gemini, and GPT on your benchmark.

Promotion is gated, not negotiated. The composite score is measured against the same three frontier LLMs the council uses to author scenarios. If the customer's model does not clear the gate, the platform refuses to ship it. No operator override unless the operator signs the evidence bundle in writing.

Composite score

A weighted aggregate of core pass rate, robustness, and benchmark coverage. The gate refuses to promote unless this score exceeds each frontier baseline.

Evidence bundle

Eval report, latency histogram, cost-per-decision figures, lineage record, approver chain, rollback contract. Every field is auditable before the model goes live.

Approver chain

Domain owner plus engineering plus (where applicable) legal or compliance. Sign-offs land in the evidence bundle and persist with the promotion record.

Rollback contract

Every promotion specifies the prior validated specialist as its rollback target. A regression triggers an automatic revert. The catastrophic-forgetting detector fires before the regression reaches users.

Talk to engineering

Book a technical walk-through. Bring one of your hardest scenarios.

We will run it through the platform in a shared session. You will see the iterate loop touch every phase, the council propose new scenarios, and a promotion record assembled against the frontier baselines.

Book a discovery call Read the public methodology