← wardtechsystems.com
Ward Tech Systems — Writing

RAG at real cost: a self-hosted Ollama fleet next to the Anthropic API

Why bulk retrieval and eval work runs on a home-network Ollama cluster, why user-facing generation stays on Claude, and a dispatcher that races both.

Every RAG or LLM-adjacent project I run eventually asks the same question: does this call need a frontier model, or does it need to be cheap enough to run ten thousand times without thinking about it? I've settled on a two-tier answer — a self-hosted, multi-node Ollama cluster ("the fleet") for private, high-volume, non-user-facing work, and the hosted Anthropic Claude API for anything user-facing where output quality directly matters. This post is about the reasoning between the two, not a claim that one replaces the other.

The setup

The fleet is a handful of Windows and Linux machines on my home network, each running Ollama. A small dispatcher sends a request to multiple hosts at once and takes whichever one responds first — first-response-wins, so a slow or busy box doesn't become the bottleneck for the whole job. This runs code review passes, RAG retrieval evals, and multimodal OCR. It does not run anything user-facing — that's a deliberate line, not an oversight. There's also a cloud fallback path for hitting the fleet from off my home network, gated behind signed tokens rather than exposing Ollama directly to the internet.

The separate insurance-restoration platform (insurtech-beta.com) uses a proper RAG knowledge base — retrieval over a corpus of estimates, building codes, and supplement arguments, not just a chat wrapper — which is the kind of task where this local/cloud split matters most: retrieval and eval passes over that corpus are exactly the high-volume, non-user-facing category the fleet is for.

Why local wins on cost, why cloud wins on quality

The qualitative shape of the tradeoff is straightforward: a hosted frontier model charges per token and produces better output, consistently, on harder tasks. A self-hosted open-weight model has near-zero marginal cost per call once the hardware exists, but variable quality depending on task difficulty. If a task runs thousands of times and "good enough" is actually good enough, that arithmetic strongly favors local. If a task runs rarely but the output quality has a real consequence (a claim write-up going to an insurance carrier, a customer-facing answer), the arithmetic favors cloud regardless of per-call cost.

[PLACEHOLDER: needs real numbers — $/1M tokens for the specific Claude model in use vs. estimated $/call for local inference, and a rough break-even volume] [PLACEHOLDER: needs real number — measured latency delta between fleet dispatch and hosted API call for a comparable task]

Failure modes of each tier

Local (the fleet): the obvious failure mode is quality drift on nuanced tasks — an open-weight model can produce a confident, well-formatted, wrong answer on anything that requires real judgment, and it's harder to catch than an outright error because nothing about the response looks broken. [PLACEHOLDER: specific failure rate or example categories where local models underperformed, if measured]. There's also a fleet-availability failure mode: if every host is busy or offline, the dispatcher has nothing to fall back to unless a cloud path is explicitly wired in.

Cloud (Anthropic API): the main failure mode is cost blowup at scale — a task that's fine to run 50 times a day becomes a real line item at 50,000 times a day, and unlike local inference the cost doesn't disappear just because you're not thinking about it. [PLACEHOLDER: real monthly cloud API spend, if this becomes relevant to disclose]. There's also an external-dependency failure mode: hosted API outages or rate limits are outside your control in a way a local box on your own network isn't.

The dispatcher, roughly

This is illustrative, not the production code — it shows the shape of the "race across hosts, first response wins" pattern:

type Host = { url: string; name: string };

async function dispatchToFleet(
  hosts: Host[],
  prompt: string,
  timeoutMs = 8000
): Promise<{ text: string; wonBy: string }> {
  const controller = new AbortController();

  const attempts = hosts.map(async (host) => {
    const res = await fetch(`${host.url}/api/generate`, {
      method: "POST",
      body: JSON.stringify({ prompt }),
      signal: controller.signal,
    });
    if (!res.ok) throw new Error(`${host.name} failed: ${res.status}`);
    const data = await res.json();
    return { text: data.response as string, wonBy: host.name };
  });

  const timeout = new Promise<never>((_, reject) =>
    setTimeout(() => reject(new Error("fleet dispatch timed out")), timeoutMs)
  );

  try {
    const winner = await Promise.race([...attempts, timeout]);
    controller.abort(); // cancel the losers
    return winner;
  } catch (err) {
    // all hosts failed or timed out — caller decides whether to
    // fall back to a hosted API here
    throw err;
  }
}

The important part isn't the code, it's the fallback decision at the bottom: what happens when every local host fails is a deliberate choice, not an afterthought — in my setup that's where a call to the hosted Claude API can slot in as a safety net for tasks that can tolerate the cost occasionally but shouldn't rely on it by default.

Replicating this with one machine

You don't need a fleet to get the core benefit. The pattern that actually matters is the tiering, not the racing: run one Ollama instance locally, route high-volume/low-stakes calls to it, and route low-volume/high-stakes calls to a hosted API. Drop the Promise.race entirely — with one host there's nothing to race — and just call it directly with a timeout and a fallback to the hosted API on failure or timeout. The dispatcher above collapses to a single fetch with a try/catch around it. The fleet only pays for itself once you have enough concurrent volume that a single box becomes the bottleneck; below that, one machine plus a hosted fallback gets you the same cost/quality split with a fraction of the operational complexity.