Blog

Opening Agentsia Labs

Independent benchmarks for the commercial verticals no leaderboard tests. Why we are building a research surface, what it publishes, and the cadence we are promising.

Ammar Doosh19 April 20266 min read

SWE-bench Verified, MMLU-Pro, GPQA Diamond, HLE, HumanEval, AIME. Every frontier release arrives with scores on some subset of those six. The scores are how the field discusses progress, how research groups compare generations, and how model provenance gets negotiated inside sophisticated buyers. They are not, however, how most commercial work gets done.

None of those six benchmarks tests adtech decisioning. None tests credit risk, clinical triage, contract interpretation, or in-vehicle perception. A CTO making a build-versus-buy call on a narrow vertical workflow has no neutral public reference to consult. They end up working from vendor case studies, an internal trial that goes stale the week a new frontier version ships, and instinct.

Frontier leaderboards measure	Commercial workflows need
SWE-bench Verified (agentic coding)	Brand-safety decisions inside the RTB auction window
MMLU-Pro (graduate multi-subject knowledge)	Credit-risk classification with auditable reasoning
GPQA Diamond (graduate physics / chemistry)	Clinical triage against a hospital's own protocol
HLE (humanities last exam)	Statutory interpretation in a specific jurisdiction
HumanEval (isolated Python tasks)	ADAS edge cases on a specific vehicle platform
AIME (olympiad mathematics)	Fraud detection on a specific customer-base pattern

What frontier leaderboards test vs what commercial work needs. The two columns do not overlap.

Agentsia Labs publishes what is missing. Open, reproducible benchmarks on commercial verticals, refreshed quarterly, retested when frontier models ship, produced end-to-end by the same platform we license to regulated enterprises.

What Labs publishes

Each Agentsia Labs release is an Assay: a named, versioned benchmark for a specific commercial vertical. The inaugural release is Assay-Adtech v1, targeted for Q2 2026.

Vertical	Commitment	Status
Assay-Adtech v1	Q2 2026	Inaugural release; panel recruitment in progress
Assay-Fintech v1	Q3 2026	Committed; rubric design starts once adtech ships
Assay-Legaltech v1	2027	In design; rubric requires jurisdictional scoping
Assay-Health v1	2027	In design; panel recruitment needs clinical reviewers
Assay-Auto v1	2027	In design; multimodal evaluator work to scope

Every Assay release ships with five artefacts:

A methodology in prose, co-signed by at least two invited volunteer reviewers who are practising operators in that vertical.
The dataset under Apache 2.0, with versioned git tags, downloadable in full.
The scoring harness as a public repository under Apache 2.0, runnable against any provider.
The raw per-model outputs in JSON.
The leaderboard: composite plus per-axis scores with variance reported across sampling seeds.

The model set per release is deliberate. Current-generation frontier APIs (Claude, Gemini, GPT) at pinned versions. Open-weights families (Nemotron, Qwen, Gemma, and where applicable Llama, DeepSeek, Mistral) at best-of-breed sizes. Agentsia post-trained specialists, labelled distinctly. Readers see the delta between an out-of-the-box open-weights model and one that has been post-trained through Modelsmith on the vertical's workflow. That delta is the moat claim Agentsia stakes its commercial posture on, and it should be measurable in public.

What Labs is not

Labs is not a dashboard demo. It is not a marketing microsite dressed up as research. It is not a playground for vendor-flavoured prose.

It is also not a claim to peer-reviewed authority yet. The panels are invited volunteer practitioners, not formal academic reviewers. The methodology is versioned and transparent, but it is not peer-reviewed in the NeurIPS sense. We will submit flagship datasets to Datasets and Benchmarks tracks when they are mature enough, and update the language on the surface when that happens.

The cadence

Every active vertical is refreshed once per quarter with the current frontier versions, new open-weights releases in scope, and rubric corrections from the prior release. Quarterly dates get published on the vertical page three weeks in advance.

Frontier-release retest SLA

2 weeks

When a new frontier model ships with a material capability delta, we retest the affected vertical within two weeks and publish a delta note. We do not wait for the next quarterly slot.

The announced cadence creates external accountability. It also sets a ceiling on how fast the underlying infrastructure has to move, which keeps the harness honest.

Why we are building this

The narrow commercial reason: Agentsia sells a platform for post-training domain-specialist language models on customer hardware. The harder a buyer looks, the more value they derive from independent, reproducible numbers they can show their own board. Labs is the receipts. Every release is a real end-to-end run through Modelsmith: synthetic scenarios out of the data-generation agent, evaluations through the same harness practitioners run on their own workflows, post-trained specialists produced through the same promotion state machine. The platform is sold by its published outputs.

The broader reason: public discourse on AI progress gets shaped by the benchmarks that are published. When those benchmarks do not include commercial reality, the discourse drifts. Adtech, fintech, legaltech, clinical workflows, and automotive agents are collectively responsible for a large share of what enterprise AI actually does. They deserve a reference.

What happens next

Over the next six to eight weeks we convene the volunteer panel for Assay-Adtech v1, finalise the rubric, run the first round of evaluations, and publish. The announcement goes out as Assay-Adtech v1 lands, with the full leaderboard, the dataset, the harness release, and a companion essay.

Between now and then, the Labs surface itself is live. The methodology primer is readable. The public roadmap is posted. The harness repository is open on GitHub. If you want to be on the list for the release, or have a vertical you think deserves an Assay and is missing from the roadmap, write to ammar@agentsia.uk.