Every RAG or LLM-adjacent project I run eventually asks the same question: does this call need a frontier model, or does it need to be cheap enough to run ten thousand times without thinking about it? I've settled on a two-tier answer — a self-hosted, multi-node Ollama cluster ("the fleet") for private, high-volume, non-user-facing work, and the hosted Anthropic Claude API for anything user-facing where output quality directly matters. This post is about the reasoning between the two, not a claim that one replaces the other.
The setup
The fleet is a handful of Windows and Linux machines on my home network, each running Ollama. A small dispatcher sends a request to multiple hosts at once and takes whichever one responds first — first-response-wins, so a slow or busy box doesn't become the bottleneck for the whole job. This runs code review passes, RAG retrieval evals, and multimodal OCR. It does not run anything user-facing — that's a deliberate line, not an oversight. There's also a cloud fallback path for hitting the fleet from off my home network, gated behind signed tokens rather than exposing Ollama directly to the internet.
The separate insurance-restoration platform (insurtech-beta.com) uses a proper RAG knowledge base — retrieval over a corpus of estimates, building codes, and supplement arguments, not just a chat wrapper — which is the kind of task where this local/cloud split matters most: retrieval and eval passes over that corpus are exactly the high-volume, non-user-facing category the fleet is for.
Why local wins on cost, why cloud wins on quality
The qualitative shape of the tradeoff is straightforward: a hosted frontier model charges per token and produces better output, consistently, on harder tasks. A self-hosted open-weight model has near-zero marginal cost per call once the hardware exists, but variable quality depending on task difficulty. If a task runs thousands of times and "good enough" is actually good enough, that arithmetic strongly favors local. If a task runs rarely but the output quality has a real consequence (a claim write-up going to an insurance carrier, a customer-facing answer), the arithmetic favors cloud regardless of per-call cost.
[PLACEHOLDER: needs real numbers — $/1M tokens for the specific Claude model in use vs. estimated $/call for local inference, and a rough break-even volume]
[PLACEHOLDER: needs real number — measured latency delta between fleet dispatch and hosted API call for a comparable task]
Failure modes of each tier
Local (the fleet): the obvious failure mode is quality drift on nuanced tasks — an open-weight model can produce a confident, well-formatted, wrong answer on anything that requires real judgment, and it's harder to catch than an outright error because nothing about the response looks broken. [PLACEHOLDER: specific failure rate or example categories where local models underperformed, if measured]. There's also a fleet-availability failure mode: if every host is busy or offline, the dispatcher has nothing to fall back to unless a cloud path is explicitly wired in.
Cloud (Anthropic API): the main failure mode is cost blowup at scale — a task that's fine to run 50 times a day becomes a real line item at 50,000 times a day, and unlike local inference the cost doesn't disappear just because you're not thinking about it. [PLACEHOLDER: real monthly cloud API spend, if this becomes relevant to disclose]. There's also an external-dependency failure mode: hosted API outages or rate limits are outside your control in a way a local box on your own network isn't.
The dispatcher, roughly
This is illustrative, not the production code — it shows the shape of the "race across hosts, first response wins" pattern:
type Host = { url: string; name: string };
async function dispatchToFleet(
hosts: Host[],
prompt: string,
timeoutMs = 8000
): Promise<{ text: string; wonBy: string }> {
const controller = new AbortController();
const attempts = hosts.map(async (host) => {
const res = await fetch(`${host.url}/api/generate`, {
method: "POST",
body: JSON.stringify({ prompt }),
signal: controller.signal,
});
if (!res.ok) throw new Error(`${host.name} failed: ${res.status}`);
const data = await res.json();
return { text: data.response as string, wonBy: host.name };
});
const timeout = new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error("fleet dispatch timed out")), timeoutMs)
);
try {
const winner = await Promise.race([...attempts, timeout]);
controller.abort(); // cancel the losers
return winner;
} catch (err) {
// all hosts failed or timed out — caller decides whether to
// fall back to a hosted API here
throw err;
}
}
The important part isn't the code, it's the fallback decision at the bottom: what happens when every local host fails is a deliberate choice, not an afterthought — in my setup that's where a call to the hosted Claude API can slot in as a safety net for tasks that can tolerate the cost occasionally but shouldn't rely on it by default.
Replicating this with one machine
You don't need a fleet to get the core benefit. The pattern that actually matters is the tiering, not the racing: run one Ollama instance locally, route high-volume/low-stakes calls to it, and route low-volume/high-stakes calls to a hosted API. Drop the Promise.race entirely — with one host there's nothing to race — and just call it directly with a timeout and a fallback to the hosted API on failure or timeout. The dispatcher above collapses to a single fetch with a try/catch around it. The fleet only pays for itself once you have enough concurrent volume that a single box becomes the bottleneck; below that, one machine plus a hosted fallback gets you the same cost/quality split with a fraction of the operational complexity.