The platform

An agent-native harness for post-training, evaluating, and promoting domain-specialist language models.

Modelsmith is the Agentsia runtime. It is typed, machine-readable, and designed for agents to operate. Human reviewers set rubrics, approve promotions, and inspect evidence bundles. Everything between those gates runs without intervention.

Substrate

vLLM, Groq, Fireworks, on-prem GPU, Nvidia consumer-grade hardware.

Control plane

Selection, evals, post-training, promotion, rollback, lineage, routing.

Application

Your agents, your workflows, your embedded product intelligence.

Why it works

A specialist an order of magnitude smaller than a frontier model can match it on your narrow workflow.

Frontier models carry the breadth of every subject humans write about. Your workflow does not need most of it. A post-trained specialist redirects training exposure entirely to one domain, so coverage per unit of capacity is much deeper. The mechanism is domain-agnostic: it works for supply-path reasoning in adtech, fraud typology in fintech, clinical triage in health, and ADAS perception in automotive.

Frontier labs optimise for the public leaderboards that sell a general-assistant subscription: SWE-bench Verified for coding, GPQA Diamond for science, AIME for mathematics, MMLU-Pro for general reasoning, HLE for frontier knowledge. None of those represent your domain. Modelsmith's first job on any engagement is to build the benchmark that does: your governed eval set, your rubric, your golden standards, versioned and governed alongside the specialist they train.

Frontier LLM

~2.6T parameters (est.)

Pre-trained on Hindi poetry, Linux kernel internals, conversational French, high-school chemistry, legal precedent in twenty jurisdictions, fifty programming languages, and every other subject humans write about.

Coverage of your domain

A fraction of a percent of the training mix.

Frontier labs do not publicly disclose parameter counts. Analyst estimates via Epoch AI.

Domain-specialist LM

0–122B parameters

Post-trained on your operational data only. In adtech: RTB logs and exchange mechanics. In fintech: transaction patterns and fraud typology. In health: clinical notes and treatment protocols. In automotive: perception traces and driving telemetry. An open-weights base model (Qwen, Nemotron, Gemma) provides general language competence underneath.

Coverage of your domain

All post-training specialisation sits on your domain.

Open-weights range from sub-1B (Qwen3-0.6B, Gemma-4-E2B) to 122B (Qwen3.5-122B-A10B).

The three-layer stack

Application, control plane, substrate. Each layer owns its responsibility.

The boundaries matter commercially. A hyperscaler that sells substrate cannot credibly govern promotion above it. A framework that sells application-layer orchestration cannot credibly own the eval corpus beneath it. Agentsia sits between both.

  1. Layer 1 · What your team builds

    Application

    Your product, your agents, your workflows. The control plane exposes typed interfaces so your application code calls specialists the same way it calls any other service.

    • Domain agents and orchestration logic
    • Product surfaces, UX, and customer data flow
    • Commercial workflow definitions
    Examples LangGraph, CrewAI, AutoGen, your own services
  2. Layer 2 · The specialisation control plane

    Agentsia

    Modelsmith is the agent harness. It decides which specialist to build, trains it on your data and evals, governs the promotion, and manages fleet routing. All state is typed and machine-readable so agents can operate it without humans in the loop for routine cycles.

    • Specialist selection from your operational data
    • Post-training (SFT, DPO, GRPO, LoRA adapters)
    • Governed LLM eval set and promotion gates
    • Fleet routing, rollback, and lineage
    • MCP tools and a typed CLI for agent operators
    Examples Modelsmith (Agentsia)
  3. Layer 3 · Where inference runs

    Substrate

    Wherever the specialist serves. That can be a commercial runtime, the same customer hardware Modelsmith trains on, or embedded silicon inside a vehicle or a mobile SoC. Agentsia is substrate-agnostic. The signed artefact pins to a base model and a runtime target; the deployment endpoint is configuration.

    • Open-weights checkpoints (Nemotron, Qwen, Gemma)
    • Inference runtime execution (vLLM, TensorRT, MLX, embedded)
    • Training compute and GPU allocation
    • Nvidia consumer-grade hardware for on-premise fleets
    • Embedded and edge silicon (automotive, mobile, on-device)
    Examples vLLM, Groq, Fireworks, TensorRT, MLX, CoreML, custom in-vehicle runtimes

Substrate topology

Three canonical deployment patterns.

The three layers are a separation of concerns, not of hardware. A customer's control plane and substrate can share a machine, or sit in different networks, or ship apart to embedded silicon. The pattern is a deployment decision; Modelsmith governs all three the same way.

T1

Co-located training and inference

The same customer hardware trains overnight and serves inference continuously. Modelsmith and the inference runtime share the machine. Promotion is a local artefact swap, not a cross-network deploy.

Example

Adtech team on Nvidia consumer-grade hardware running vLLM 24/7 to surface revenue opportunities

T2

Trained centrally, deployed to the edge

Modelsmith trains on dedicated hardware. The signed artefact ships to embedded silicon, mobile SoCs, or a custom in-vehicle runtime. Agentsia never runs on the edge device. Only the artefact does.

Example

Autopilot perception on in-vehicle silicon, or on-device specialists on mobile

T3

Trained centrally, deployed to a commercial runtime

Modelsmith trains on customer hardware. The specialist deploys to a managed inference vendor when the customer does not want to run their own inference layer.

Example

Hosted inference on vLLM, Groq, or Fireworks behind the customer VPC

Current scope

Text-first

Modelsmith is tested today on text-input, text-output post-training and inference. Multimodal post-training (vision, audio) is in active development and ships on the same promotion state machine when released. Existing text specialists promote forward without a rebuild.

The iterate loop

A closed loop that runs without supervision between judgement gates.

Four phases. Each emits typed state the next phase consumes. The harness runs the loop on a cadence the customer sets, and humans only appear at explicit gates in the promotion state machine.

  1. 01Eval

    Governed eval set runs against candidate

    The harness loads the governed scenario set, runs the candidate specialist, and computes composite score, per-rubric pass rate, and held-out regression. All runs are content-addressed against the eval-set git SHA.

    Emits
    run_id, composite, per_rubric[], regressions[]
  2. 02Diagnose

    Failures classified, signal extracted

    Failures cluster by rubric and root-cause. Persistent patterns generate synthetic training data under the customer rubric. Novel failures stage as new scenarios and queue for human review before entering the governed set.

    Emits
    failure_clusters[], synthetic_batch, pending_scenarios[]
  3. 03Post-train

    Next iteration trains on combined signal

    Supervised fine-tuning, DPO, or GRPO on the customer base model with LoRA adapters. Supports Nemotron, Qwen, Gemma, or any open-weights checkpoint. Training artefacts are pinned to the run_id and stored on customer hardware.

    Emits
    artefact_uri, adapter_sha, train_config
  4. 04Promote

    Composite clears gate, specialist deploys

    If the candidate clears the promotion gate without regression, the state machine advances automatically. Novel failures, threshold misses, or rubric additions escalate to a human approver with the full evidence bundle attached.

    Emits
    promotion_decision, evidence_bundle, rollback_contract
ReturnAfter Promote, the loop re-enters Eval. Production failures captured between iterations become new scenarios in the next governed set.

Cadence

The loop runs continuously in production environments, typically overnight. Customers with multi-unit hardware can run several iterations in parallel each day to shorten the convergence window.

Observability

Every phase writes to a lineage ledger on customer hardware. Composite trajectory, rubric drift, and promotion history are inspectable from the local dashboard. Agentsia never sees this data.

Promotion state machine

Six states. Every transition is auditable and reversible.

The state machine is the contract between Modelsmith and your governance. Transitions that are safe to automate run without human review. Transitions that carry commercial or regulatory consequence require a typed approval.

  1. S1
    candidate
  2. S2
    eval-accepted
  3. S3
    artifact-exported
  4. S4
    customer-deployed
  5. S5
    production-validated
  6. S6
    deprecated
Any state before production-validated can roll back to the prior specialist in seconds.
S1

candidate

Gate · Harness

A new training run has produced an adapter. No eval has been run against the governed set.

Transition

Governed eval set runs and composite clears the threshold.

S2

eval-accepted

Gate · Harness

Composite passes, no regression against previously passing scenarios, held-out pass rate above threshold.

Transition

Artefact is signed and exported to the customer artefact store.

S3

artifact-exported

Gate · Human

Signed model weights, adapter, and evidence bundle are sealed and immutable.

Transition

Customer approver reviews the evidence bundle and approves deployment.

S4

customer-deployed

Gate · Human

The specialist is routed a configurable traffic share. Live metrics compare specialist to the incumbent for the validation window.

Transition

Customer approver confirms production metrics held against the baseline; promotion advances to full traffic.

S5

production-validated

Gate · Human

Specialist is promoted to full traffic. Lineage ledger records the promotion, approver, and evidence bundle.

Transition

Customer approver retires the specialist once a later one supersedes it or the eval rubric is deprecated.

S6

deprecated

Gate · System

The specialist is retained for audit and rollback, but no new traffic is routed to it.

Transition

Terminal state. Artefact and evidence remain addressable by lineage ID.

Rollback contract

Every specialist ships with an explicit rollback contract: the prior production-validated artefact, the exact routing configuration, and the metric window in which rollback is automatic. A failed validation window returns traffic to the incumbent without a redeploy.

Agent-native by design

Every surface is typed, scriptable, and safe for an agent to operate.

Modelsmith exposes a CLI and an MCP server with typed tools. Claude Code, Cursor, and Codex drive the iterate loop through both surfaces. Humans appear only at explicit judgement gates in the promotion state machine.

CLI

Available

Run the governed eval set against a cluster.

Typed flags, JSON-shaped output, deterministic exit codes. The bash wrapper and Python CLI are safe to drive from a Claude Code, Cursor, or Codex session.

$ modelsmith eval --quick --cluster rtb-pricing

  Modelsmith Fleet Status  ·  2026-04-18 09:12:04 BST

  cluster        rtb-pricing
  model          qwen3-32b-awq
  iteration      17
  scenarios      3 (held-out, representative)
  phase          3 Robustness eval

  composite      0.894   PASS  (threshold 0.880)
  held_out       0.872   PASS  (threshold 0.860)
  regressions    0 / 248
  duration       4m 12s

MCP tool

Available

An agent launches an eval run.

The Modelsmith MCP server exposes typed tools to every agent operator. Returns schema-validated JSON so the calling agent can route decisions without a screen scrape.

// agent -> modelsmith_eval_run

  {
    "tool": "modelsmith_eval_run",
    "args": {
      "tag": "rtb-pricing-iter-17-core",
      "cluster": "rtb-pricing",
      "workers": 2,
      "type": "core",
      "domain": "adtech"
    }
  }

  // response

  {
    "jobId": "a3f82c41",
    "status": "started",
    "tag": "rtb-pricing-iter-17-core",
    "log": ".iterate/rtb-pricing/logs/eval-a3f82c41.log"
  }

Promotion record

Available

A promotion writes a signed record to the cluster ledger.

Today, modelsmith promote writes an immutable JSON record with iteration, composite, eval tag, and adapter path. PR-native promotion (opening a reviewable pull request with the evidence bundle attached) is on the roadmap.

$ modelsmith promote rtb-pricing

  Promotion request
    Cluster:   rtb-pricing
    Model:     qwen3-32b-awq
    Iteration: 17
    Composite: 89.4%
    Eval tag:  qwen3-32b-tq-rtb-pricing-iter17-core
    Adapter:   /iterate/rtb-pricing/adapters/iter-17.safetensors

  Promotion record written.
    Directory: .iterate/rtb-pricing/promotions/
    Latest:    2026-04-18-manual-promote.json

Tool catalogue

Nine tools live today. Three more groups in active development.

Every tool is a typed MCP entry that accepts a JSON arguments object and returns a schema-validated JSON response. Unknown tools are refused. Malformed arguments return a typed error envelope. The Promotion, Governance, and Data groups below exist in the codebase as planned control-plane modules and are being wired into the canonical settings.

Status & fleet

Available

Cluster states, vLLM health, composite history, GPU utilisation across the training fleet.

  • modelsmith_status
  • modelsmith_fleet_status
  • modelsmith_fleet_health
  • modelsmith_composite_history

Evaluation

Available

Run governed eval scenarios, fetch results from the Turso ledger, compare two runs.

  • modelsmith_eval_run
  • modelsmith_eval_results
  • modelsmith_compare

Training

Available

Start GRPO training from eval failures; monitor job progress from an agent session.

  • modelsmith_train_start
  • modelsmith_train_status

Promotion

Coming soon

Candidate → eval-accepted → artifact-exported → customer-deployed → production-validated → deprecated, with typed approval gates.

  • promote
  • rollback
  • get_candidate
  • request_approval

Governance & evidence

Coming soon

Policy-gate enforcement, evidence bundle export, approver chain, and immutable audit ledger.

  • evidence_bundle
  • policy_gate
  • approver_chain
  • lineage_query

Data & rubrics

Coming soon

Review synthetic training signal, diff rubrics, accept new scenarios into the governed eval set.

  • review_synthetic
  • diff_rubric
  • accept_scenario

Machine-readable state

Every action writes typed state to the iterate ledger and a Turso cloud DB queryable by tag. No log scraping, no free-form strings.

Runbooks as code

Operational playbooks live in config/runbooks/ as structured JSON. Agents execute them by invoking tools, not by reading prose.

Judgement gates are explicit

The promotion state machine codifies which transitions require a human approval. The harness refuses to self-approve on those paths.

Agent integrations

Coming soon

Roadmap: first-class packs for the coding agents your team already runs.

The Modelsmith MCP server and CLI work today through any MCP-capable coding agent. The polished distribution surface described below is in active development: installable packs for Claude Code, Cursor, and Codex; a slash-command alias layer; a typed event stream; and a structured AGENTS.md operator contract. Everything shown here is intended behaviour, not current behaviour.

Claude Code

Coming soon
claude.ai/code

Installable skill pack

Three named skills will ship in the pack: operator (full harness access), reviewer (read-only audit), and promoter (promotion-workflow only). Your customer IdP gates which skill an operator can load.

$ claude plugin install agentsia/modelsmith
.claude/skills/modelsmith-operator.md
---
name: modelsmith-operator
description: Drive the Modelsmith iterate loop (eval, train,
  promote, rollback). Read the lineage ledger. Never self-approve.
allowed-tools: modelsmith_*
---

You operate Modelsmith via typed MCP tools. Principles:

- Always call modelsmith_eval_results before discussing
  composite state.
- On regression, trigger rollback immediately and escalate.
- Never self-approve a transition whose gate is Human.
- All tool calls return typed JSON. Do not parse prose.

Cursor

Coming soon
cursor.com

MDC rule file + MCP server config

MDC rules will auto-apply when the agent opens files under evals/, .iterate/, or the customer promotion repo. Modelsmith registers as an MCP server so every tool in the catalogue is discoverable without configuration.

$ npx @agentsia/modelsmith-cursor init
.cursor/rules/modelsmith.mdc
---
description: Modelsmith operator rules
globs:
  - "evals/**/*.yaml"
  - ".iterate/**"
  - "promotions/**"
alwaysApply: false
---

# Modelsmith

- Use the modelsmith_eval_run tool for governed evals. Do not
  invoke model endpoints directly from scratch scripts.
- Promotion transitions go through the CLI or the MCP
  tool. Never hand-edit the state ledger.
- Evidence bundles are immutable. If one is missing, open
  a rerun. Do not reconstruct by hand.

Codex

Coming soon
OpenAI agents

Plugin manifest + tool registry

Codex will consume the same typed tool surface through a plugin manifest. Authorisation scopes will mirror the Claude Code skill split (operator, reviewer, promoter) so your RBAC model is portable across harnesses.

$ npx @agentsia/modelsmith-codex init
.codex/plugins/modelsmith.json
{
  "name": "modelsmith",
  "version": "1.0",
  "runtime": "mcp",
  "endpoint": "https://localhost:7443/mcp",
  "auth": "bearer:$MODELSMITH_TOKEN",
  "scopes": {
    "operator":  ["modelsmith_*"],
    "reviewer":  ["modelsmith_eval_results", "modelsmith_composite_history"],
    "promoter":  ["promote", "rollback", "modelsmith_fleet_status"]
  }
}

Slash commands

Coming soon

The harness the way operators type it.

Each slash command will resolve to a typed tool call under the bonnet. The same authorisation scopes apply. A reviewer who types /ms-promote gets the same typed refusal an agent would receive from the MCP surface.

/ms-eval <cluster>

Run the governed eval set against a cluster.

/ms-train <cluster> <base>

Launch a post-training run on the named base model and cluster.

/ms-promote <cluster>

Write a promotion record with evidence bundle and rollback contract.

/ms-rollback <cluster>

Return traffic to the prior production-validated incumbent within the rollback window.

/ms-fleet

Summarise fleet state, specialists in validation, and pending approvals.

/ms-audit <range>

Export the immutable audit bundle for a date range.

Hooks and events

Coming soon

Every state transition will be a subscribable event.

Subscribe through the MCP event stream, Server-Sent Events, or a signed webhook. Supervisor agents, oncall rotations, and audit pipelines will listen to the same typed payloads.

on.eval.complete(run_id, composite, regressions[])

Fires after every governed eval run, pass or fail.

on.eval.regression(run_id, failing_scenarios[], prior_pass)

Fires only when a previously passing scenario fails.

on.training.complete(run_id, artefact_uri, adapter_sha)

Fires when a training run has written a signed artefact.

on.promotion.pending_approval(candidate, approver_chain, evidence_bundle)

Fires when a state transition needs a human gate.

on.promotion.advanced(specialist, from_state, to_state, actor)

Fires on every transition in the six-state machine.

on.rollback.triggered(specialist, reason, metric_window)

Fires on any rollback, automatic or manual.

on.fleet.drift(metric, baseline, observed, window)

Fires when a live specialist drifts outside its validation envelope.

AGENTS.md

Coming soon

The local contract between your agents and your governance.

Modelsmith will write a structured AGENTS.md into the customer repo alongside its config. Agents read it on session start to understand local judgement-gate policies, approver chains, rollback windows, and which scopes are permitted for autonomous operation. The same file will be re-read by Claude Code, Cursor, and Codex. One contract, all harnesses.

AGENTS.md
# Modelsmith: agent operator contract

## Scopes permitted without human approval
- modelsmith_eval_*
- modelsmith_composite_history
- modelsmith_fleet_health
- review_synthetic (read-only)

## Scopes that require a typed human approval
- promote (artifact-exported -> customer-deployed)  -> head-of-trading
- promote (customer-deployed -> production-validated) -> head-of-trading
- rollback                                            -> oncall-sre
- accept_scenario                                     -> domain-lead

## Rollback policy
- Default validation window: 60 minutes
- Automatic rollback on: auction_win_rate drop > 4%
- Manual rollback preserved for 48h post-promotion

## Judgement gates
- Novel failure modes escalate; do not self-classify.
- Rubric additions always require domain-lead sign-off.
- Evidence bundles are immutable once promoted.

Two operating surfaces

The dashboard runs on your hardware. The portal runs on ours.

The split is architectural, not cosmetic. Customer data, model weights, training state, and deployment metrics never leave customer hardware. Agentsia hosts only what does not require them: licensing, downloads, documentation, and commercial workflow.

Operational

On-premise local dashboard

Runs on customer hardware

Day-to-day operation of the iterate loop and the fleet.

Where
Customer Nvidia consumer-grade hardware. Runs from the customer fork of Modelsmith.
Auth
Customer IdP (OIDC or SAML). Agentsia never authenticates end users.
Network
No outbound internet required. Licence validation is the only connectivity touchpoint.

Responsibilities

  • Real-time iterate-loop telemetry and lineage ledger
  • Eval scenario authoring, rubric diff, promotion queue
  • Fleet routing, specialist-vs-baseline live metrics
  • Rollback controls and validation-window monitoring
  • Evidence bundle download and audit export

Executive

agentsia.uk portal

Hosted by Agentsia

Licensing, documentation, downloads, and support.

Where
app.agentsia.uk. Standard public cloud hosting on the London edge.
Auth
Clerk, with SSO and MFA on Pro tier. Scoped to the buying organisation.
Network
Public internet, encrypted in transit, rate-limited.

Responsibilities

  • Licence key retrieval and rotation
  • Platform release downloads and domain starter kits
  • Docs access and design-partner gated materials
  • Support tickets and incident response
  • Billing and commercial workflow

The contract

Customer data never leaves customer hardware. Agentsia cannot see your models, training state, eval results, or deployment metrics. The air-gap is preserved by architecture, not by policy.

Anti-use cases, clarified

Four reasons teams wrongly self-disqualify.

The public discourse around fine-tuning makes it easy to rule out a specialist platform for the wrong reasons. Here are the four we hear most, with what the actual boundary condition is in each case.

01

"Our data changes weekly, so we cannot train a specialist."

What matters is whether the task is stable. Your data should change.

Every customer has data that moves. The stable surface is the task being asked of the model, the rubric for success, and the evaluation scenarios. Retraining on fresh data is precisely what Modelsmith does well. Specialists handle volatile data through retraining and RAG, not through a fixed weight snapshot.

02

"The work is not latency-sensitive, so there is no reason to leave frontier APIs."

Cost is the other reason. A local specialist runs 24/7 at zero marginal cost.

An overnight agent analysing marketplace settings for revenue opportunities costs nothing to run on owned hardware. The same workload on a frontier API runs into five figures a month. Latency-irrelevant workflows are often the strongest economic case for a local specialist, not a disqualifier.

03

"We use RAG today. A specialist would replace our retrieval stack."

RAG and specialists are complementary. You want both.

RAG handles volatile knowledge that belongs in an index. A specialist handles the stable domain judgement the retrieval layer feeds into. Combining them reduces over-reliance on RAG for reasoning tasks it cannot do well. Modelsmith integrates with your existing retrieval stack rather than replacing it.

04

"We do not have ML engineers or agent specialists on staff."

The minimum viable team is one technical reviewer plus one domain expert.

Agentsia is operated by agents. Your team supplies the judgement: a technically adept reviewer who can read evals and evidence bundles, and at least one domain expert who knows what the rubric should reward. The reviewer drives the harness through an agentic coding tool, so a Claude Code, Cursor, or Codex licence is a practical prerequisite. Professional Services supplies the rest for the first specialist. ML engineers are not a requirement.

Genuine anti-use cases

  • Tasks whose success criteria cannot be written down. If the rubric cannot be codified, there is nothing to train against.
  • Workloads where the frontier general model is already acceptable and the budget is small. Agentsia is a capital project, not a quick win.
  • Teams that want a managed API and no ownership of the model weights. Agentsia is for teams that want the artefacts.
  • One-off inference tasks with no improvement loop. If there is no iterate cadence, there is no specialisation to govern.

See the receipts

Read how Modelsmith-built specialists stack up against frontier APIs in your vertical.

Every Agentsia Labs benchmark is a real end-to-end run through Modelsmith: synthetic scenarios, eval harness, post-trained specialist, promotion record. Open methodology, published datasets, reproducible numbers.