Skip to main content

Platform

Platform surface

The client-facing Modelsmith operating surface: benchmark design, controlled iteration, fleet execution, evidence bundles, and governed promotion.

Modelsmith is a governed specialisation platform for customer-owned model workflows. The public operating surface is the part a client team needs to trust before running a domain model through it: how quality is defined, where the model runs, how each iteration is measured, and which evidence is required before a specialised candidate can be promoted.

This page describes the client contract. Deployment-specific commands, repository paths, local automation, and internal operator tooling are not part of the public product surface.

Client-facing surfaces

SurfaceClient question it answersWhat should be visible
Benchmark workbenchWhat does "better" mean for this domain?Scenario coverage, pass and fail criteria, held-out split, rubric ownership, and balance by capability, risk, and outcome type.
Model intakeWhat is being specialised, where, and against which gate?Base model, target deployment boundary, benchmark version, promotion threshold, owner, and rollback expectation.
Iterate ledgerWhat happened in each loop and why?Current phase, score movement, failure clusters, capacity state, training progress, blocked states, and next decision.
Evidence bundleCan a reviewer defend the promotion decision?Scorecards, baselines, regression count, latency, cost-per-decision, lineage, approvals, and rollback posture.
Promotion gateWho approved the candidate and on what evidence?Gate status, approver chain, open blockers, final decision, and immutable promotion record.

What clients operate

Benchmark workbench

The benchmark is the first controlled asset. A domain expert defines realistic scenario prompts, success criteria, failure criteria, and representative domain inputs. Modelsmith uses that benchmark as the objective function for evaluation, training, and promotion.

Clients should expect to review:

  • scenario coverage by capability, risk, operator lens, and outcome type
  • pass and fail criteria that a domain owner can defend
  • held-out scenarios that remain separate from training material
  • corpus balance reports that show whether the benchmark overweights one failure mode

Model intake

Modelsmith records the model, benchmark version, target execution boundary, serving target, and promotion criteria before a run starts. The goal is to remove notebook-driven ambiguity: the run has an owner, a target, and a measurable gate from the beginning.

Clients should expect intake to answer:

  • which base model is being specialised
  • which benchmark version defines "better"
  • where training and inference run
  • which promotion threshold must be cleared
  • what rollback record will exist if the model is promoted

Iterate ledger

The iterate loop runs through evaluation, training, assessment, and promotion staging. Every phase consumes evidence from the previous phase and records the next decision. If the model stalls, the system should surface the plateau and the recommended next action rather than hide the failure.

The client-visible status should be phrased in operational terms:

  • current phase and latest transition reason
  • current score, previous accepted score, and target threshold
  • strongest failure cluster and regression count
  • fleet capacity and training progress
  • whether promotion is blocked by quality, governance, or infrastructure

Fleet and data boundary

Modelsmith is designed for customer-owned model and data boundaries. Training and inference run on the approved execution target. Evidence leaves the run in sanitised form: summaries, metrics, scorecards, lineage records, approvals, and review labels.

Public and shared surfaces must not expose raw datasets, prompts, completions, logs, filesystem paths, hostnames, secrets, or opaque artefact identifiers.

Evidence bundle

An evidence bundle is the review packet attached to a candidate. It should let a domain owner, engineering owner, and compliance reviewer inspect the same facts before promotion.

The bundle should include:

  • benchmark version and scenario coverage
  • candidate score against configured baselines
  • held-out performance and regression count
  • latency and cost-per-decision measurements
  • lineage from base model to candidate
  • approver chain and sign-off status
  • rollback contract and operational notes

For the full review packet, see Evidence bundles. For the approval sequence, see Promotion gates.

Role model

RoleOwnsNeeds from the surface
Domain ownerBenchmark relevance and rubric quality.Scenario review, weak-rubric challenges, coverage gaps, and pass/fail criteria.
OperatorRun supervision and blocked-state triage.Phase state, capacity state, training progress, failure clusters, and next action.
ReviewerEvidence quality before approval.Score movement, baselines, regressions, lineage, rollback posture, and open risk.
ApproverFinal promotion decision.Gate status, evidence summary, approval record, and rollback trigger.

Day-one journey

  1. Define what better means. A domain expert writes 20 to 40 scenarios with explicit pass and fail criteria.
  2. Select the model and boundary. The team chooses the base model, benchmark version, execution target, and promotion threshold.
  3. Run the loop. Modelsmith evaluates, trains, assesses, and records each decision until the candidate improves or reaches a documented stop condition.
  4. Review the evidence. The team inspects score movement, failure clusters, latency, cost, lineage, and rollback posture.
  5. Promote or continue. The candidate ships only when the configured gate and approval chain are satisfied.