Modelsmith · agent-native specialist model factory
Encode your deepest domain experts into language models that outperform Claude and GPT.
Your AdOps, legal, or account-management lead authors the benchmark. The platform iterates against it, calls in a frontier council of Claude, Gemini, and GPT at every blind spot, and ships a small domain-specialist language model that beats them on your benchmark.
You own the benchmark, the weights, and the hardware they run on.
Customer data, evals, prompts, weights, and promotion records remain under customer control.
The argument
If the benchmark does not exist, frontier AI cannot win it.
Provider leaderboards do not test your auctions, exchange integrations, publisher context, pacing logic, floor-price rules, or fraud patterns. Modelsmith turns that private adtech signal into the tests, training data, and release gates the public market will not produce.
Your marketplace defines the test.
Trained on the auction you operate.
Create the benchmark
The first product surface is the governed eval set the public market does not have.
Train on private signal
Failures, rubrics, logs, and golden standards become post-training signal inside the customer boundary.
Release only when it clears
A specialist only moves forward when private tests, latency, cost, and rollback checks clear the gate.
The benchmark you never had
Modelsmith's first output is your benchmark: dozens of scenarios across blocklist management, marketplace inefficiencies, supply-fill optimisation, and trust & compliance enforcement, each capturing how a senior domain expert would handle a synthetic situation. The model trains against the benchmark and generalises in ways frontier LLMs cannot replicate.
Your governed eval set, your rubric, and your golden standards are versioned alongside the specialist they train. The model is promoted only when it clears the gates.
private eval
training signal
promotion gate
The product category
Built for agents to operate and humans to govern.
Agent frameworks assemble behaviour. Inference vendors optimise execution. Between them sits the operating layer that agents can safely use to evaluate, experiment, adapt, promote, roll back, and deploy specialist models.
Agentsia is that operating layer. Modelsmith post-trains private specialist AI models from open-weight foundations, evaluates them against governed domain-specific evals, and packages promotion evidence for the customer's chosen substrate.
Modelsmith complements vLLM, Groq, Fireworks, LangGraph, and CrewAI. It governs the model lifecycle those layers do not own: evals, experiments, post-training, accepted state, rollback, and lineage.
Agent frameworks and applications
Assemble agent behaviour, product logic, and the workflow the customer wants to automate.
LangGraph, CrewAI-style orchestration, product workflows
Agentsia / Modelsmith agent harness
Runs domain-specific evals, controlled experiments, post-training, tool/runtime constraints, promotion gates, rollback, lineage, and evidence bundles.
Observe, evaluate, experiment, adapt, promote, deploy, monitor
Inference and training substrate
Executes training and serving on the customer-chosen or customer-controlled infrastructure.
vLLM, Groq, Fireworks, private cloud, on-premise, edge
inference vendors
agent frameworks
fine-tuning APIs
your chosen substrate
Honest comparison
How Modelsmith sits next to the tools you already know.
Three classes of adjacent tooling come up in every buyer conversation. The comparison below separates direct alternatives from tools that sit before or after Modelsmith in the stack.
Serving providers are often confused with Modelsmith. Groq, Fireworks, and Modal run weights. Modelsmith produces them. Many customers use both.
Determined AI, Ludwig
You bring the MLOps team and the governance practice. Modelsmith ships those packaged: eval gating, lineage, approver chain, rollback. With an open-source framework you assemble the lifecycle yourself.
Together.ai
A hosted platform with fine-tuning and inference under one roof. Modelsmith differs on three axes: agent-native operating model (a domain expert drives the platform through agentic coding tools), eval-gated promotion (the model only ships when it beats the operator-authored benchmark), and owned weights running inside the customer's own substrate.
Groq, Fireworks, Modal
Serving vendors run the weights that Modelsmith produces. Customers usually choose that layer after the specialist is trained, based on latency, cost, and sovereignty requirements.
Excluded here: fine-tuning APIs from frontier model vendors. Anthropic does not offer fine-tuning. OpenAI's fine-tuning is available only on prior model generations. Neither qualifies as a live alternative for the work Modelsmith targets.
Honest disqualification
When Modelsmith is the wrong tool.
Specialist-model factories pay off on a specific shape of work: a stable domain, a human who can author the benchmark, and a substrate the customer controls. If your job is not that shape, the simpler tool is usually the right one.
We would rather lose the call than land a customer who would be better served by a frontier API or a vector database.
You need an internal chatbot.
A general-purpose conversational assistant is exactly what frontier APIs are built for. Modelsmith would be overkill, slower to set up, and more expensive than calling Claude or ChatGPT directly.
Use instead: Anthropic, OpenAI, Google.
Your task is well-served by retrieval over your docs.
Use retrieval when a frontier API can answer from your document index. Modelsmith is for cases where private signals need to change model behaviour.
Use instead: Vector DB plus a frontier API.
You have no domain expert who can sit with the platform.
The benchmark a domain expert authors is the platform's first output and the asset every subsequent model is graded against. Without that human authorship, there is nothing for the platform to train towards.
Use instead: Hire or assign a domain expert first.
The definition of "good" changes every week.
Modelsmith trains against a stable benchmark. Domains where the success criteria churn faster than the model can retrain are a poor fit. The benchmark itself becomes the bottleneck.
Use instead: Iterate on the rubric with a frontier API until the definition stabilises.
You want one model to do everything.
Modelsmith ships specialist models, each grading against a domain-specific eval. A generalist that answers calendar questions, writes marketing copy, and prices ad inventory is the opposite of what the platform produces.
Use instead: A general-purpose frontier API.
You cannot run the model inside your own boundary.
Ownership of weights and the substrate they run on is half the value proposition. If a hosted inference API meets your compliance, latency, and cost requirements without that ownership, the simpler architecture wins.
Use instead: A hosted inference API.
The evidence
Every promoted specialist ships with a private evidence bundle.
Your team can inspect the eval report, latency histogram, cost-per-decision evidence, lineage record, approver chain, and rollback contract before a specialist moves forward.
Agentsia Labs publishes public methodology. Customer data, customer evals, prompts, completions, weights, adapters, and customer-specific results stay private.
Pricing
Published licence tiers. No contact-sales gate.
Annual licence, fixed-fee post-training programme, and an honest account of the customer-controlled infrastructure your team buys directly.
Your first specialist wedge. Annual licence covers Modelsmith, the iterate engine, eval framework, and support.
Adjacent specialists that share infrastructure and compounding operational patterns.
Multi-domain specialist fleet with deployment governance, support, and customer-controlled evidence boundaries. Pricing is negotiated per engagement.
First-year deployment shape
A typical first year starts with a bounded evaluation, then moves through specialist build, evidence handoff, and customer-controlled deployment.
Proof first. One specialist next. Expansion only after the first workflow clears the agreed bar.
- 01Proof
Synthetic POC
Specialist-vs-frontier evaluation on one bounded workflow.
£25k credited
- 02Design
Evaluation design
Governed scenarios, rubrics, baselines, and success criteria.
Fixed fee
- 03Build
First specialist
Post-training campaign, evidence bundle, promotion handoff.
Year 1 licence
- 04Deploy
Deployment boundary
Hardware, private cloud, or chosen substrate under your control.
Customer-owned
- 05Scale
Expansion
Adjacent specialists sharing the same eval and governance pattern.
Tier B/C
Professional Services
Specialisation Kickstart: £100,000 bundled, £155,000 à la carte.
Five priced phases: eval design, scenario generation, connector configuration, first training campaign, production-validation handoff. The bundle is reserved for design partners.
Hardware
Customer-owned, from £3,500 per unit.
Nvidia consumer-grade hardware, owned by the customer. Extra units shorten training time for the same specialist fleet.
Synthetic POC
£25,000, 3 to 4 weeks. A specialist-vs-frontier head-to-head on your vertical, seeded from publicly available domain knowledge. Paid upfront and credited in full against your Year 1 licence on conversion.
Design Partner programme
Adtech design-partner cohort. £225,000 Year 1 (£125,000 Tier B licence + £100,000 Specialisation Kickstart). Reverts to Tier B list from Year 2.
Trust & security
We publish what is in place and what is in progress.
Regulated buyers in adtech, fintech, and health ask the same security questions. We answer them with a living roadmap rather than a compliance badge.
In place today
- Zero-trust networking. All platform services run on an authenticated mesh. No internet-exposed endpoints.
- Short-lived JWT authentication with Argon2 password hashing.
- Zod schema validation on every API endpoint.
- HTTP hardening with Helmet headers and a strict CORS allowlist.
- Dedicated SSRF-prevention module covering URL validation, internal-IP blocking, and redirect following.
- UK-GDPR consent middleware on all sensitive data paths.
- Gitleaks secret scanning in CI on platform repositories.
- Published security.txt and a responsible-disclosure contact.
In progress
- SOC 2. Vendor selection Q3 2026, Type I audit Q4 2026, Type II target H1 2027.
- ISO 27001. Gap analysis Q4 2026, certification target H2 2027.
- UK-GDPR full programme. DPIA, Article 5(1)(a) lawful basis, Article 17 right-to-erasure. Target Q2-Q3 2026.
- First third-party penetration test, Q4 2026.
- Gitleaks and Dependabot rolled out organisation-wide by Q2 2026.
Customer data never leaves customer hardware. The monitoring dashboard runs on-premise.
Agentsia never sees your models, training state, or deployment data. Licence validation is the only connectivity touchpoint, and it only happens at promotion time.
Start here
Three ways to evaluate Agentsia.
The paths differ in commitment. The outcome is the same: one bounded workflow evaluated against a frontier baseline and a specialist model.
Book a discovery call
Thirty minutes. Bring one bounded workflow, one latency or cost constraint, and the rubric your team already trusts.
Read the Agentsia Labs evals
Independent domain-specific evals for commercial workflows that public leaderboards do not measure. Open methodology, published datasets, reproducible scores.
Apply for design-partner access
The adtech design-partner cohort is for teams ready to onboard Modelsmith against a real workflow.
Adtech design-partner cohort open for a small number of deployments.