System · Accepted state
Approach · 04docs/modelsmith-product-strategy.md

Seven pillars, in compounding order.

What is built, what is designed, and what remains. Honest status, updated with the last commit.

Modelsmith is strongest when all seven pillars operate together. Any one can be copied in part; the interaction compounds. These are the load-bearing claims of the product, written with enough precision that you can hold us to them.

Status key

Strong · built and load-bearing
Good · built, with open gaps
Designed · ADR written
Not started
PillarI
Strong

Fully autonomous specialisation

Manual fixes are architecture bugs. Every routine improvement cycle runs without intervention.

The iterate loop runs a closed cycle — evaluate, diagnose, train, re-evaluate — with zero manual intervention for routine improvement. You set targets and review novel failure modes. Modelsmith handles the rest.

For genuinely novel failures Modelsmith cannot classify, it packages everything needed for diagnosis into a structured escalation artefact. Your agentic coding tool is the first responder. You approve the proposed fix.

Capabilities

  • Eval → train → re-eval engine
  • Failure classification + auto-recovery
  • SFT warmup for cold-start models
  • vLLM self-managing lifecycle
  • Automatic adapter rollback on regression
  • Adaptive hyperparameter scheduling
  • Automatic held-out set rotation
  • Automatic scenario proposal from failures
  • Structured escalation artefact format
PillarII
Strong

Agent-first operating model

Two surfaces, one truth. Machine-readable state and an executive view that shows the same accepted state.

Every serious operation is expressible as a stable, inspectable, automatable workflow your agent can run and your team can govern.

The operational surface is the system of record: machine-readable state, structured logs, CLI, MCP tools, REST API, and promotion workflows. The executive surface is a dashboard that makes model health, ROI, and fleet status legible to non-technical reviewers. Both resolve to the same accepted state.

Capabilities

  • Machine-readable fleet state (JSON)
  • REST API for training operations
  • Config-driven model onboarding
  • Pool-based training host selection
  • Agent-executable runbooks
  • MCP server for agent tool access
  • Watchdog REST API
  • Executive ROI view
PillarIII
Strong

Domain specialist factory

Continuously produce low-latency specialist models that beat frontier labs on your commercial workflows.

The proof is concrete: generate a synthetic eval suite from publicly available domain knowledge, run it against frontier models and a Modelsmith specialist, demonstrate five compounding advantages.

RAG and trained weights are complementary. The specialist handles stable domain knowledge — decision logic, taxonomy, workflow patterns. RAG handles live or volatile knowledge. We reduce your reliance on RAG for the stable layer. We do not replace it.

Capabilities

  • Health domain eval suite (77 scenarios)
  • Adtech domain eval suite (117 scenarios)
  • Expert-per-context LoRA training
  • Head-to-head frontier benchmarking
  • GRPO with partial rubric credit
  • Production latency benchmarks
  • Synthetic proof-of-value demo tooling
PillarIV
Good

Promotion and governance control plane

Every promotion produces an evidence bundle. You approve judgement gates; agents execute mechanical transitions.

Full lifecycle governance with explicit, auditable, reversible state transitions. Candidate → eval-accepted → shadow → canary → production-accepted → deprecated.

Modelsmith governs state and supplies the deployment artefact. You control shadow and canary in your own infrastructure. Modelsmith provides the specialist package, the evidence bundle, and the rollback contract.

Capabilities

  • Promotion state machine (6 states)
  • Transition validation + approval gates
  • Formal promotion records with evidence
  • Automatic rollback on regression
  • Eval suite versioning
  • Packaged deployment artefact format
  • Human approval UI in dashboard
PillarV
Strong

Config-driven model lifecycle

Adding a new model requires zero new files. A single JSON entry is the complete specification.

A single JSON entry in clusters.json is the complete model specification: inference settings, training hyperparameters, LoRA targets, quantisation strategy, cluster assignments. Every downstream artefact is derived from it at runtime.

All vLLM start scripts delegate to a unified start-model.sh. All iterate wrappers delegate to a single CLI. Shell scripts become a compatibility layer, not the product surface.

Capabilities

  • Model profiles in clusters.json
  • Unified vLLM start script (17/17 delegating)
  • Config-driven iterate wrappers
  • Docker Compose base + extends
  • Schema validation in CI
  • Smoke test on onboard
  • `modelsmith model add` CLI command
PillarVI
Good

Compounding knowledge moat

Eval scenarios, safety nets, rubrics, and golden standards compound over time. Federated opt-in across deployments.

Each iteration compounds the platform's advantage. The accumulated domain corpus — evals, safety nets, rubrics, golden standards — takes years to build and cannot be replicated cheaply.

The eval set is governed through a two-layer architecture: the governed layer (immutable without your approval) and the expansion layer (scenario variants auto-proposed from persistent failures, staged for review). Cross-deployment federated patterns let opted-in deployments benefit from aggregate platform knowledge.

Capabilities

  • Eval scenario accumulation (194+ scenarios)
  • Safety nets (58 deterministic checks)
  • Golden standard generation
  • Cross-iteration trend analysis
  • Automatic scenario proposal
  • Two-layer eval governance UI
  • Adtech domain KB pipeline
  • Federated opt-in pattern sharing
PillarVII
Strong

Fleet intelligence and infrastructure

A fleet of compute, added by config. Training fails over transparently. Infrastructure self-heals.

Training hosts are treated as a pool, not as static cluster-to-host assignments. The iterate loop probes hosts, scores them by affinity and availability, and selects the best candidate. If a host goes offline, its clusters fail over transparently.

A deterministic routing layer dispatches queries to the appropriate specialist based on classification rules. Routing is not the moat, the specialists are, so we keep the router stable, predictable, auditable, and debuggable.

Capabilities

  • 4-Spark fleet with QSFP mesh
  • Pool-based host failover
  • Unified memory isolation
  • Per-Spark self-contained iterate
  • Scheduled DB maintenance cron
  • Idempotent bootstrap script
  • Deterministic routing layer (config rules)
  • `modelsmith fleet add` CLI command

On honesty

We publish the gaps.

Most enterprise AI pages promise the target state. We publish what is built, what is designed, and what is missing. This list is updated on every merged pull request. If a pillar drifts, you see it here before you see it in our sales deck.