All posts

Blog

Fine-tuning vs RAG: combine them

The framing that a team must pick between fine-tuning an open-weights model and building a RAG pipeline is wrong. The two techniques do different jobs. Here is when you need each, and how they fit together in a real production workflow.

Agentsia19 April 20266 min read

Every few months a team asks us the same question. They have a workflow that needs a language model. They can afford to build exactly one thing: should they fine-tune an open-weights model on their data, or set up a RAG pipeline against a frontier API? The vendor that has sold them on one option has usually told them the other is legacy.

Both halves of that framing are wrong. Fine-tuning and retrieval-augmented generation solve different problems. The right answer is almost always some combination of the two, and the specific combination depends on what the workflow actually looks like.

What each technique changes

Fine-tuningRAG
ChangesModel weightsPrompt context
When appliedTraining timeInference time
TeachesSkills, formats, decision patterns, domain vocabularyNothing
SuppliesNothing new at query timeFacts, sources, current data
Update cadenceQuarterly retrain cycleIndex updates in minutes
Best atMaking the model reason in your institution's voiceKeeping the model's knowledge current
Blind spotNew facts after training cutoffReasoning about retrieved facts
Fine-tuning changes behaviour; RAG changes context. The two are orthogonal and compose cleanly.

Fine-tuning changes model behaviour. Given enough well-labelled examples, a base model learns to produce outputs in a different shape: answering in a specific format, reasoning through a specific domain vocabulary, reaching decisions by a specific policy. What fine-tuning does not change is what the model knows. A model fine-tuned on last year's contracts still has last year's contract knowledge next year.

RAG changes model context. At inference time you retrieve the most relevant passages from a knowledge base and paste them into the prompt. The model answers with those passages in hand. The behaviour of the model is whatever the base model's behaviour is; RAG is pure context engineering. What RAG cannot do is teach the model how to reason about the retrieved passages. If the base model is poor at synthesising contract clauses into a recommendation, retrieving more clauses does not help.

Fine-tuning teaches skills. RAG supplies facts. They compose.

Three workflows, three recipes

Brand-safety classifier for programmatic advertising. The decision is narrow: given a URL and its surrounding context, should the bid proceed? The vocabulary is specific (IAB categories, brand blocklists, contextual signals). The correct answer on a given URL does not change week to week once the policy is set. Fine-tuning is the right tool. RAG is not. A 4-billion-parameter model post-trained on 5,000 labelled URLs and the relevant IAB taxonomy beats a frontier API doing the same task at a fraction of the cost per decision, inside the RTB auction window.

Support assistant for a complex enterprise product. The questions span hundreds of features; documentation updates weekly; customers care about source citations. Fine-tuning does not help here. The relevant behaviour (read a question, find the right doc, answer with citations) is what a frontier model already does well out of the box. What you need is fast, accurate retrieval over the current product docs. RAG is the right tool. Fine-tuning would waste the effort and freeze the knowledge.

Clinical decision support tool at a hospital network. The behaviour is specific (use the hospital's own triage protocol, respect local formulary, cite the relevant patient record). The facts are specific (the patient's history, current lab values, the formulary as of today). Neither fine-tuning nor RAG alone is sufficient. You fine-tune on the hospital's protocol and documentation norms to make the model reason in the institution's voice. You retrieve the patient's record and the current formulary at inference to keep the facts current. Both, layered.

The integration pattern in practice

A specialist model that combines both is straightforward enough to sketch. The fine-tuned model lives on the customer's hardware. Queries go through a thin retrieval step first: a vector store over the live knowledge base returns the three to five most relevant passages. Those passages are assembled into the prompt under a <context> block using the exact schema the model was fine-tuned to expect. The model answers.

Two details matter. First, the fine-tuning dataset should include examples of the same <context> shape you will feed at inference time, even during training. A model that has never seen the retrieval format at training time gets confused at inference time; a model that has seen 500 examples of the format learns to use it as a source of facts rather than as general chatter. Second, the retrieval index updates independently of the model. Documentation changes weekly; the model retrains quarterly. The split keeps the facts current without paying the post-training cost every time a doc changes.

When the mix is wrong

There are two common failure modes.

The first is fine-tuning on a problem that needs retrieval. Teams try to bake product knowledge into the weights and then discover that every documentation update requires a retrain. They would have been better off leaving the behaviour alone and retrieving the docs.

The second is retrieving for a problem that needs fine-tuning. Teams push longer and longer context windows against a frontier API and find the outputs drift in tone or format from what the workflow needs. They would have been better off fine-tuning on a few thousand examples of the desired output shape.

Where this lands for us

Our bet is that most commercial workflows are "both" workflows. Modelsmith treats post-training and retrieval as first-class citizens of the same pipeline. The same harness that evaluates a model's bid-shading judgement also evaluates whether it correctly cites a retrieved passage. The same promotion state machine that gates a post-training run gates changes to the retrieval index. A specialist model and its retrieval context ship together or not at all.

Buyers in regulated industries sometimes ask us to separate these, reasoning that they want to keep their retrieval pipeline in-house and only use Agentsia for post-training. That arrangement works. We think it leaves value on the table, but the integration is clean either way: post-train for behaviour, retrieve for facts, and keep them versioned together so that when either changes the evaluation runs again.

The only wrong answer is picking one because a vendor told you to.