Blog

Post-training at 3 a.m.: inside a closed-loop agent harness for open-weights models

Manual fine-tuning is a dead end for enterprise AI. We built Modelsmith as a closed-loop agent harness where evals, diagnosis, and post-training happen autonomously while humans stay at the gates.

Ammar Doosh3 May 20267 min read

The most expensive hour in any AI project is 3 a.m.

It is the hour when a lead engineer is staring at a terminal, comparing two evaluation runs that should be identical but aren't. They've spent the last six hours manually cleaning a dataset, relaunching a fine-tuning job, and waiting for the loss curves to converge, only to find that the new model has fixed a brand-safety bug but introduced a massive regression in bid-shading accuracy.

This manual "slog" — the cycle of guess-and-check fine-tuning — is the primary reason why open-weights models fail to reach production in most enterprises. The complexity of managing these weights is too high for a human-only team to sustain.

At Agentsia, we decided the human was the bottleneck. We built Modelsmith not just as a training platform, but as a closed-loop agent harness.

Autonomous iteration rate

>85%

Percentage of post-training cycles in Modelsmith that require zero human intervention until the final promotion gate.

From training to orchestration

In a traditional workflow, "training" is a verb. It is something a human does to a model. In the Modelsmith vision, training is a side effect of a governed orchestration loop. The goal is to move from manual experimentation to a system where agents handle the heavy lifting while humans remain "at the gates."

When we talk about a "closed-loop harness," we mean a system that can observe its own failures and take corrective action without waiting for a Jira ticket. This is the difference between a static model and a living specialist.

The loop: Eval, Diagnose, Post-train

The Modelsmith harness operates on a simple but rigorous closed-loop architecture. It starts with the assumption that no model is ever "finished." Instead, every model is in a constant state of being challenged by new data and evolving policies.

1. Eval: Pinning the scenario

Before a single weight is changed, we pin the model against private benchmarks. In the context of programmatic advertising, this means RTB scenario pinning. We don't just ask the model if it's "safe" in a general sense; we replay 10,000 real-world auction requests and measure the delta between its decisions and the ground-truth policy.

Scenario pinning is our version of unit testing for weights. If a 1B parameter model cannot correctly identify a "Made-for-Advertising" (MFA) site in under 10ms, it doesn't matter how high its MMLU score is. It has failed the pin and its Composite Score will reflect the regression.

2. Diagnose: Agentic failure analysis

When a model fails a benchmark, we don't just dump a CSV of errors for a human to sift through. A diagnosis agent — usually a larger frontier model acting as a "critic" — ingests the failure cases and the model's internal reasoning traces.

The agentic critic might conclude: "The model is correctly identifying brand-safe content in standard news articles, but it is failing to reason about layout signals in the 'Verticals: Gaming' segment, leading to false negatives on high-utility ad slots." This level of granular diagnosis is what allows the loop to close effectively.

3. Post-train: Targeted adapter generation

Instead of retraining the whole model from scratch — a process that is slow, expensive, and prone to forgetting — Modelsmith triggers a post-training run using adapter-based techniques like LoRA or QLoRA.

The harness automatically synthesizes a high-precision dataset specifically designed to fix the diagnosed failure mode, often utilizing GRPO (Group Relative Policy Optimization) for alignment. It then trains a lightweight adapter that "patches" the model's behavior. Because the adapter is only a few megabytes, we can version, test, and swap them with surgical precision across multiple Clusters in parallel.

The Modelsmith Closed-Loop. Agents handle the inner loop of evaluation and diagnosis, while humans supervise the promotion gate.

The Human at the Gates: Evidence-Bundle Generation

If the inner loop is autonomous, where does the human fit? The answer is in governance.

Modelsmith does not automatically deploy models to production. That would be irresponsible in a regulated enterprise environment. Instead, it generates an evidence bundle. This is a comprehensive package containing everything a human needs to make an informed decision:

The Failure Trace: The exact inputs that caused the previous version to fail.
The Diagnosis Report: The critic agent's analysis of the root cause.
Dataset Lineage: Cryptographic proof of which data was used to generate the new adapter.
Benchmark Delta: A "Before vs. After" comparison showing the accuracy gain on the failed segment and the stability of unrelated segments.
Integrity Proof: Verification that the training run occurred inside a secure, audited environment.

A human operator reviews the evidence bundle and signs off on the promotion. This preserves the "Human-in-the-Loop" (HITL) requirement while removing the "Human-is-the-Engine" bottleneck.

Why Open-Weights are Mandatory

The reason we focus this harness on open-weights models like Llama 3, Mistral, and Qwen is simple: you cannot modelsmith what you do not own.

To run a closed-loop harness, you need the ability to:

Inspect Activations: Understand why a model reached a decision at the neuron level.
Attach Adapters: Hot-swap behaviors in real-time.
Localize Inference: Run the model on your own hardware to hit the 100ms latency targets required by real-time auctions.

Proprietary APIs are black boxes. They are fine for prototyping, but they are a liability for core business logic. By bringing the modelsmithing process in-house, enterprises transform from "AI consumers" into "AI manufacturers."

Technical Deep Dive: Scenario Pinning for Adtech

In a how-it-works deep dive, we often compare scenario pinning to a "flight simulator" for LLMs. If you are building a private specialist for the adtech bid-path, the simulator must replicate the exact constraints of the auction: 10ms latency budgets, truncated context, and high-entropy input data.

By pinning these scenarios, the harness can detect "behavioral drift" that standard benchmarks like MMLU would miss. If a new post-training run makes the model 1% more accurate at math but 5% slower at classifying URLs, the harness flags it as a regression. In the bid-stream, 5ms of latency is often more expensive than a 1% accuracy drop.

Metric	Manual Slog	Modelsmith Harness
Time to detect regression	Days	Minutes
Iteration cadence	Weekly	Multi-daily
Human cost per model	High (PhD-heavy)	Low (Orchestrator-heavy)
Traceability	Fragmented	Full Evidence Bundle

A comparison of manual vs. closed-loop iteration cycles.

Conclusion: The end of the 3 a.m. hero

The "hero culture" of AI engineering — where success depends on one person staying up late to fix a model — is a sign of an immature industry. It is a fragile way to build critical infrastructure.

The future belongs to the teams that stop training models and start building harnesses. When the harness is doing the heavy lifting at 3 a.m., the humans can finally sleep. They aren't needed to run the forge; they are needed to judge the blade.

Ready to transition from manual fine-tuning to a closed-loop harness? Apply to the design-partner programme to start building your own fleet of private specialists.