Platform

Platform surface

The client-facing Modelsmith operating surface: benchmark design, controlled iteration, fleet execution, evidence bundles, and governed promotion.

Modelsmith is a governed specialisation platform for customer-owned model workflows. The public operating surface is the part a client team needs to trust before running a domain model through it: how quality is defined, where the model runs, how each iteration is measured, and which evidence is required before a specialised candidate can be promoted.

This page describes the client contract. Deployment-specific commands, repository paths, local automation, and internal operator tooling are not part of the public product surface.

Client-facing surfaces

Surface	Client question it answers	What should be visible
Benchmark workbench	What does "better" mean for this domain?	Scenario coverage, pass and fail criteria, held-out split, rubric ownership, and balance by capability, risk, and outcome type.
Model intake	What is being specialised, where, and against which gate?	Base model, target deployment boundary, benchmark version, promotion threshold, owner, and rollback expectation.
Iterate ledger	What happened in each loop and why?	Current phase, score movement, failure clusters, capacity state, training progress, blocked states, and next decision.
Evidence bundle	Can a reviewer defend the promotion decision?	Scorecards, baselines, regression count, latency, cost-per-decision, lineage, approvals, and rollback posture.
Promotion gate	Who approved the candidate and on what evidence?	Gate status, approver chain, open blockers, final decision, and immutable promotion record.

What clients operate

Benchmark workbench

The benchmark is the first controlled asset. A domain expert defines realistic scenario prompts, success criteria, failure criteria, and representative domain inputs. Modelsmith uses that benchmark as the objective function for evaluation, training, and promotion.

Clients should expect to review:

scenario coverage by capability, risk, operator lens, and outcome type
pass and fail criteria that a domain owner can defend
held-out scenarios that remain separate from training material
corpus balance reports that show whether the benchmark overweights one failure mode

Model intake

Modelsmith records the model, benchmark version, target execution boundary, serving target, and promotion criteria before a run starts. The goal is to remove notebook-driven ambiguity: the run has an owner, a target, and a measurable gate from the beginning.

Clients should expect intake to answer:

which base model is being specialised
which benchmark version defines "better"
where training and inference run
which promotion threshold must be cleared
what rollback record will exist if the model is promoted

Iterate ledger

The iterate loop runs through evaluation, training, assessment, and promotion staging. Every phase consumes evidence from the previous phase and records the next decision. If the model stalls, the system should surface the plateau and the recommended next action rather than hide the failure.

The client-visible status should be phrased in operational terms:

current phase and latest transition reason
current score, previous accepted score, and target threshold
strongest failure cluster and regression count
fleet capacity and training progress
whether promotion is blocked by quality, governance, or infrastructure

Fleet and data boundary

Modelsmith is designed for customer-owned model and data boundaries. Training and inference run on the approved execution target. Evidence leaves the run in sanitised form: summaries, metrics, scorecards, lineage records, approvals, and review labels.

Public and shared surfaces must not expose raw datasets, prompts, completions, logs, filesystem paths, hostnames, secrets, or opaque artefact identifiers.

Evidence bundle

An evidence bundle is the review packet attached to a candidate. It should let a domain owner, engineering owner, and compliance reviewer inspect the same facts before promotion.

The bundle should include:

benchmark version and scenario coverage
candidate score against configured baselines
held-out performance and regression count
latency and cost-per-decision measurements
lineage from base model to candidate
approver chain and sign-off status
rollback contract and operational notes

For the full review packet, see Evidence bundles. For the approval sequence, see Promotion gates.

Role model

Role	Owns	Needs from the surface
Domain owner	Benchmark relevance and rubric quality.	Scenario review, weak-rubric challenges, coverage gaps, and pass/fail criteria.
Operator	Run supervision and blocked-state triage.	Phase state, capacity state, training progress, failure clusters, and next action.
Reviewer	Evidence quality before approval.	Score movement, baselines, regressions, lineage, rollback posture, and open risk.
Approver	Final promotion decision.	Gate status, evidence summary, approval record, and rollback trigger.

Day-one journey

Define what better means. A domain expert writes 20 to 40 scenarios with explicit pass and fail criteria.
Select the model and boundary. The team chooses the base model, benchmark version, execution target, and promotion threshold.
Run the loop. Modelsmith evaluates, trains, assesses, and records each decision until the candidate improves or reaches a documented stop condition.
Review the evidence. The team inspects score movement, failure clusters, latency, cost, lineage, and rollback posture.
Promote or continue. The candidate ships only when the configured gate and approval chain are satisfied.