Modelsmith · agent-native specialist model factory

Encode your deepest domain experts into language models that outperform Claude and GPT.

Your AdOps, legal, or account-management lead authors the benchmark. The platform iterates against it, calls in a frontier council of Claude, Gemini, and GPT at every blind spot, and ships a small domain-specialist language model that beats them on your benchmark.

You own the benchmark, the weights, and the hardware they run on.

How we measure this →

Book a discovery call Read the benchmarks

Customer data, evals, prompts, weights, and promotion records remain under customer control.

The argument

If the benchmark does not exist, frontier AI cannot win it.

Provider leaderboards do not test your auctions, exchange integrations, publisher context, pacing logic, floor-price rules, or fraud patterns. Modelsmith turns that private adtech signal into the tests, training data, and release gates the public market will not produce.

01Public frontier model

Trained on what everyone can see.

public examscoding tasksweb textgeneral chatpublic benchmarksgeneric reasoningcommon documentsbroad knowledgeprovider evals+ everyone else's benchmark

Missing benchmark

Your marketplace defines the test.

exchange rulespublisher quirksbidstream gapscampaign pacingfloor-price rulesauction window

03Owned auction model

Trained on the auction you operate.

RTB logsbid requestspublisher contextfloor-price logicfraud signalsauction feedbackpacing policysupply-path rulesmargin guardrails

Create the benchmark

The first product surface is the governed eval set the public market does not have.

Train on private signal

Failures, rubrics, logs, and golden standards become post-training signal inside the customer boundary.

Release only when it clears

A specialist only moves forward when private tests, latency, cost, and rollback checks clear the gate.

The benchmark you never had

Modelsmith's first output is your benchmark: dozens of scenarios across blocklist management, marketplace inefficiencies, supply-fill optimisation, and trust & compliance enforcement, each capturing how a senior domain expert would handle a synthetic situation. The model trains against the benchmark and generalises in ways frontier LLMs cannot replicate.

Your governed eval set, your rubric, and your golden standards are versioned alongside the specialist they train. The model is promoted only when it clears the gates.

private eval

training signal

promotion gate

The product category

Built for agents to operate and humans to govern.

Agent frameworks assemble behaviour. Inference vendors optimise execution. Between them sits the operating layer that agents can safely use to evaluate, experiment, adapt, promote, roll back, and deploy specialist models.

Agentsia is that operating layer. Modelsmith post-trains private specialist AI models from open-weight foundations, evaluates them against governed domain-specific evals, and packages promotion evidence for the customer's chosen substrate.

Modelsmith complements vLLM, Groq, Fireworks, LangGraph, and CrewAI. It governs the model lifecycle those layers do not own: evals, experiments, post-training, accepted state, rollback, and lineage.

Layer 1

Agent frameworks and applications

Assemble agent behaviour, product logic, and the workflow the customer wants to automate.

LangGraph, CrewAI-style orchestration, product workflows

Layer 2

Agentsia / Modelsmith agent harness

Runs domain-specific evals, controlled experiments, post-training, tool/runtime constraints, promotion gates, rollback, lineage, and evidence bundles.

Observe, evaluate, experiment, adapt, promote, deploy, monitor

Layer 3

Inference and training substrate

Executes training and serving on the customer-chosen or customer-controlled infrastructure.

vLLM, Groq, Fireworks, private cloud, on-premise, edge

01 observe02 evaluate03 experiment04 adapt05 promote06 deploy07 monitor

Distinct from

inference vendors

Above

agent frameworks

Separate from

fine-tuning APIs

Works with

your chosen substrate

Client operating surfaces

Benchmark workbenchModel intakeIterate ledgerFleet boundaryPromotion gateEvidence bundle

Honest comparison

How Modelsmith sits next to the tools you already know.

Three classes of adjacent tooling come up in every buyer conversation. The comparison below separates direct alternatives from tools that sit before or after Modelsmith in the stack.

Serving providers are often confused with Modelsmith. Groq, Fireworks, and Modal run weights. Modelsmith produces them. Many customers use both.

Open-source framework

Determined AI, Ludwig

You bring the MLOps team and the governance practice. Modelsmith ships those packaged: eval gating, lineage, approver chain, rollback. With an open-source framework you assemble the lifecycle yourself.

Closest direct competitor

Together.ai

A hosted platform with fine-tuning and inference under one roof. Modelsmith differs on three axes: agent-native operating model (a domain expert drives the platform through agentic coding tools), eval-gated promotion (the model only ships when it beats the operator-authored benchmark), and owned weights running inside the customer's own substrate.

Complement (runs the weights)

Groq, Fireworks, Modal

Serving vendors run the weights that Modelsmith produces. Customers usually choose that layer after the specialist is trained, based on latency, cost, and sovereignty requirements.

Excluded here: fine-tuning APIs from frontier model vendors. Anthropic does not offer fine-tuning. OpenAI's fine-tuning is available only on prior model generations. Neither qualifies as a live alternative for the work Modelsmith targets.

Honest disqualification

When Modelsmith is the wrong tool.

Specialist-model factories pay off on a specific shape of work: a stable domain, a human who can author the benchmark, and a substrate the customer controls. If your job is not that shape, the simpler tool is usually the right one.

We would rather lose the call than land a customer who would be better served by a frontier API or a vector database.

You need an internal chatbot.

A general-purpose conversational assistant is exactly what frontier APIs are built for. Modelsmith would be overkill, slower to set up, and more expensive than calling Claude or ChatGPT directly.

Use instead: Anthropic, OpenAI, Google.

Your task is well-served by retrieval over your docs.

Use retrieval when a frontier API can answer from your document index. Modelsmith is for cases where private signals need to change model behaviour.

Use instead: Vector DB plus a frontier API.

You have no domain expert who can sit with the platform.

The benchmark a domain expert authors is the platform's first output and the asset every subsequent model is graded against. Without that human authorship, there is nothing for the platform to train towards.

Use instead: Hire or assign a domain expert first.

The definition of "good" changes every week.

Modelsmith trains against a stable benchmark. Domains where the success criteria churn faster than the model can retrain are a poor fit. The benchmark itself becomes the bottleneck.

Use instead: Iterate on the rubric with a frontier API until the definition stabilises.

You want one model to do everything.

Modelsmith ships specialist models, each grading against a domain-specific eval. A generalist that answers calendar questions, writes marketing copy, and prices ad inventory is the opposite of what the platform produces.

Use instead: A general-purpose frontier API.

You cannot run the model inside your own boundary.

Ownership of weights and the substrate they run on is half the value proposition. If a hosted inference API meets your compliance, latency, and cost requirements without that ownership, the simpler architecture wins.

Use instead: A hosted inference API.

Agentsia Labs · Released

Brand-safety, bid shading, and MFA classification under a 100ms auction envelope.

Assay-Ad tech is the current focus. Released corpus v1.8.0-rc.4 with current-hash production frontier baselines for Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5.

Read on labs.agentsia.uk

The evidence

Every promoted specialist ships with a private evidence bundle.

Your team can inspect the eval report, latency histogram, cost-per-decision evidence, lineage record, approver chain, and rollback contract before a specialist moves forward.

Agentsia Labs publishes public methodology. Customer data, customer evals, prompts, completions, weights, adapters, and customer-specific results stay private.

See the Agentsia Labs methodology and reference leaderboards

Pricing

Published licence tiers. No contact-sales gate.

Annual licence, fixed-fee post-training programme, and an honest account of the customer-controlled infrastructure your team buys directly.

Tier AStarter

from £60,000/yr

1-3 specialists

Your first specialist wedge. Annual licence covers Modelsmith, the iterate engine, eval framework, and support.

Tier BGrowth

from £150,000/yr

4-8 specialists

Adjacent specialists that share infrastructure and compounding operational patterns.

Tier CEnterprise

priced on scope

8+ specialists

Multi-domain specialist fleet with deployment governance, support, and customer-controlled evidence boundaries. Pricing is negotiated per engagement.

First-year deployment shape

A typical first year starts with a bounded evaluation, then moves through specialist build, evidence handoff, and customer-controlled deployment.

Proof first. One specialist next. Expansion only after the first workflow clears the agreed bar.

01Proof
Synthetic POC
Specialist-vs-frontier evaluation on one bounded workflow.
£25k credited
02Design
Evaluation design
Governed scenarios, rubrics, baselines, and success criteria.
Fixed fee
03Build
First specialist
Post-training campaign, evidence bundle, promotion handoff.
Year 1 licence
04Deploy
Deployment boundary
Hardware, private cloud, or chosen substrate under your control.
Customer-owned
05Scale
Expansion
Adjacent specialists sharing the same eval and governance pattern.
Tier B/C

Professional Services

Specialisation Kickstart: £100,000 bundled, £155,000 à la carte.

Five priced phases: eval design, scenario generation, connector configuration, first training campaign, production-validation handoff. The bundle is reserved for design partners.

Hardware

Customer-owned, from £3,500 per unit.

Nvidia consumer-grade hardware, owned by the customer. Extra units shorten training time for the same specialist fleet.

Synthetic POC

£25,000, 3 to 4 weeks. A specialist-vs-frontier head-to-head on your vertical, seeded from publicly available domain knowledge. Paid upfront and credited in full against your Year 1 licence on conversion.

Design Partner programme

Adtech design-partner cohort. £225,000 Year 1 (£125,000 Tier B licence + £100,000 Specialisation Kickstart). Reverts to Tier B list from Year 2.

Trust & security

We publish what is in place and what is in progress.

Regulated buyers in adtech, fintech, and health ask the same security questions. We answer them with a living roadmap rather than a compliance badge.

In place today

Zero-trust networking. All platform services run on an authenticated mesh. No internet-exposed endpoints.
Short-lived JWT authentication with Argon2 password hashing.
Zod schema validation on every API endpoint.
HTTP hardening with Helmet headers and a strict CORS allowlist.
Dedicated SSRF-prevention module covering URL validation, internal-IP blocking, and redirect following.
UK-GDPR consent middleware on all sensitive data paths.
Gitleaks secret scanning in CI on platform repositories.
Published security.txt and a responsible-disclosure contact.

In progress

SOC 2. Vendor selection Q3 2026, Type I audit Q4 2026, Type II target H1 2027.
ISO 27001. Gap analysis Q4 2026, certification target H2 2027.
UK-GDPR full programme. DPIA, Article 5(1)(a) lawful basis, Article 17 right-to-erasure. Target Q2-Q3 2026.
First third-party penetration test, Q4 2026.
Gitleaks and Dependabot rolled out organisation-wide by Q2 2026.

Customer data never leaves customer hardware. The monitoring dashboard runs on-premise.

Agentsia never sees your models, training state, or deployment data. Licence validation is the only connectivity touchpoint, and it only happens at promotion time.

Start here

Three ways to evaluate Agentsia.

The paths differ in commitment. The outcome is the same: one bounded workflow evaluated against a frontier baseline and a specialist model.

Book a discovery call

Thirty minutes. Bring one bounded workflow, one latency or cost constraint, and the rubric your team already trusts.

Book a call

Read the Agentsia Labs evals

Independent domain-specific evals for commercial workflows that public leaderboards do not measure. Open methodology, published datasets, reproducible scores.

Visit Labs

Apply for design-partner access

The adtech design-partner cohort is for teams ready to onboard Modelsmith against a real workflow.

Apply

Adtech design-partner cohort open for a small number of deployments.