Skip to content
zechim
Back to blog
4 min readclaudeperformancecost

How we cut Claude inference cost 60% with model routing

Haiku for the tool loop, Sonnet only for the final synthesis. Aggressive prompt caching. Real numbers from what changed in Zechim's demo cost.

by Jonatas Zechim

Version 1 of our Acme Store demo used Sonnet 4.5 for everything. Every conversation was a series of Sonnet calls: to reason, to pick a tool, to interpret the result, to synthesize the final answer.

It cost a lot. Not in absolute dollars (the demo is small), but as a signal: if we ran like this in production for a client with volume, the bill would climb fast.

We refactored the agent to route between models. Today most traffic uses Haiku 4.5; only the final synthesis uses Sonnet 4.5 when it's worth it. Cost per conversation dropped ~60%. Perceived UX stayed THE SAME or slightly better (Haiku is faster).

This post is about how.

What changes when you route

A typical agent conversation about data has 3-5 model calls:

  1. Initial decision: "Which tool do I call?" - usually trivial
  2. SQL generation: the model writes the SELECT
  3. Result interpretation: looking at the 10 rows that came back
  4. Final answer synthesis: explaining to the user

Steps 1, 2, and 3 work great on Haiku. They're relatively mechanical tasks. Haiku 4.5 doesn't fall behind Sonnet at a SELECT when properly instructed via system prompt.

Step 4 - the synthesis - is where Sonnet makes a real difference. When you want elegant paragraphs, tone adapted to the user, comparisons with context. There it pays off.

The rule: Haiku in the tool loop, Sonnet only when you need quality prose. Which is usually the last step.

How to decide which model to use

The rough method we use:

Use Haiku when:
- The task is classification / structured extraction
- The task is generating code that another system will execute
- The output will be PARSED, not READ
- The context fits in under 8k tokens

Use Sonnet when:
- The output goes straight to a human reader
- You need multi-step reasoning within the SAME turn
- You need prose quality
- Long context (>20k tokens) with cross-references

In practice, in an agent with tool use, this means:

  • Tool loop (decide tool + assemble arguments + interpret return) = Haiku
  • Final synthesis to the user = Sonnet
  • Simple cases (one-line answers, no heavy data) = Haiku does everything

Implementation

Vercel's AI SDK doesn't have a native "model router," but you can do it manually. Pseudo-code:

// Pass 1: Haiku decides + runs tools
const toolResult = await generateText({
  model: anthropic('claude-haiku-4-5'),
  messages,
  tools,
  stopWhen: ({ steps }) => steps.length >= 6,
})

// Pass 2: Sonnet writes the final response based on tool outputs
const finalResult = await streamText({
  model: anthropic('claude-sonnet-4-5'),
  messages: [
    ...messages,
    { role: 'assistant', content: toolResult.text },
    { role: 'user', content: 'Summarize this for the user in clear, friendly prose.' }
  ],
})

In practice, we don't always run the second pass. For simple questions, Haiku's output is already fine. Sonnet kicks in only when the tool call count was high, or the result has multiple rows that need narrative explanation.

Prompt caching: the second win

Another cost cut: Anthropic's prompt caching. Our agent's system prompt is ~3000 tokens (schema description, behavior rules, examples). Without caching, those 3000 tokens hit FULL PRICE on every call.

With caching enabled:

{
  role: 'system',
  content: [
    {
      type: 'text',
      text: SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' }
    }
  ]
}

Anthropic charges ~10% of normal price for cached tokens. In multi-turn conversations, the system prompt is read multiple times; in all turns except the first, it costs 10%. In long conversations, this dominates the math.

Combined with Haiku/Sonnet routing, the bill dropped from ~$0.012 per conversation to ~$0.005. Sixty percent reduction, same UX (actually slightly faster).

When NOT to route

Routing adds complexity. There are cases where it's not worth it:

  • Low volume: if you run 100 conversations a month, the absolute savings are small. Use Sonnet for everything, save engineering time.
  • Latency-critical: every model switch is one more HTTP call. For UX under 500ms, stick to one model.
  • Response must be perfect every time: financial, medical, legal product. There "Haiku is almost as good" doesn't cut it. Sonnet always.

Most medium-volume B2B cases land in the routing sweet spot.

Demo numbers

Real stats from the public demo (Acme Store) over the last month:

  • Conversations: ~400
  • Average model calls per conversation: 3.2
  • Cost per conversation before routing (Sonnet everywhere): ~$0.012
  • Cost per conversation today (Haiku in the loop, Sonnet only on synthesis when triggered): ~$0.005
  • Perceived latency: slightly lower (Haiku streams faster)
  • Perceived quality: equal on tested cases, better in cases where Haiku is more concise

In client production the savings compound hard. 60% × 10x the volume × 12 months = the year's AI budget.


If you're running agents in production and never looked at the cost breakdown per model, it's worth a conversation. Inference cost optimization is one of the first places we step into for teams already running AI.