How fast can you start?

A Discovery sprint runs 2 weeks and ends with architecture + execution plan. A pilot in real production lands in 4 to 8 weeks. Full systems in 3 to 6 months. We discuss timelines case by case before committing.

How does pricing work?

Discovery is fixed price (clean way to validate fit). Pilot and Production are scoped. Fractional CTO is a monthly retainer. Engagements in BRL for Brazil-based clients or USD for international. No open-ended hourly billing, no open-ended scope.

Does my data leave my infrastructure?

Not unless you want it to. We do on-prem deployments, dedicated instances in your VPC, or sovereign models running in Brazil. LGPD / GDPR aware by default. We frequently run with models via Bedrock or Vertex AI inside your own cloud account.

What stack do you work with?

Anthropic Claude and OpenAI for LLMs. Vector stores (pgvector, Pinecone, Turbopuffer). Orchestration via LangGraph, Inngest, or custom agentic loops. Next.js / React for frontends. Postgres, MongoDB, or whatever you already run for data. MCP for client integration (Claude Desktop, Cursor, etc). No stack religion.

And shipping to production?

Vercel, Cloudflare Workers, AWS, GCP, Railway, Fly, or bare metal. We handle infrastructure as code (Terraform), CI/CD, observability (logs, metrics, evals), and cost. The agent you see at /demo is a worked example: public stack, predictable cost, nightly reset via cron.

What happens after delivery?

Your call: full handoff (documented and we walk away), monthly support retainer (SLA + on-call), or Fractional CTO continuing as a senior virtual team. No lock-in.

How we cut Claude inference cost 60% with model routing

Version 1 of our Acme Store demo used Sonnet 4.5 for everything. Every conversation was a series of Sonnet calls: to reason, to pick a tool, to interpret the result, to synthesize the final answer.

It cost a lot. Not in absolute dollars (the demo is small), but as a signal: if we ran like this in production for a client with volume, the bill would climb fast.

We refactored the agent to route between models. Today most traffic uses Haiku 4.5; only the final synthesis uses Sonnet 4.5 when it's worth it. Cost per conversation dropped ~60%. Perceived UX stayed THE SAME or slightly better (Haiku is faster).

This post is about how.

What changes when you route

A typical agent conversation about data has 3-5 model calls:

Initial decision: "Which tool do I call?" - usually trivial
SQL generation: the model writes the SELECT
Result interpretation: looking at the 10 rows that came back
Final answer synthesis: explaining to the user

Steps 1, 2, and 3 work great on Haiku. They're relatively mechanical tasks. Haiku 4.5 doesn't fall behind Sonnet at a SELECT when properly instructed via system prompt.

Step 4 - the synthesis - is where Sonnet makes a real difference. When you want elegant paragraphs, tone adapted to the user, comparisons with context. There it pays off.

The rule: Haiku in the tool loop, Sonnet only when you need quality prose. Which is usually the last step.

How to decide which model to use

The rough method we use:

Use Haiku when:
- The task is classification / structured extraction
- The task is generating code that another system will execute
- The output will be PARSED, not READ
- The context fits in under 8k tokens

Use Sonnet when:
- The output goes straight to a human reader
- You need multi-step reasoning within the SAME turn
- You need prose quality
- Long context (>20k tokens) with cross-references

In practice, in an agent with tool use, this means:

Tool loop (decide tool + assemble arguments + interpret return) = Haiku
Final synthesis to the user = Sonnet
Simple cases (one-line answers, no heavy data) = Haiku does everything

Implementation

Vercel's AI SDK doesn't have a native "model router," but you can do it manually. Pseudo-code:

// Pass 1: Haiku decides + runs tools
const toolResult = await generateText({
  model: anthropic('claude-haiku-4-5'),
  messages,
  tools,
  stopWhen: ({ steps }) => steps.length >= 6,
})

// Pass 2: Sonnet writes the final response based on tool outputs
const finalResult = await streamText({
  model: anthropic('claude-sonnet-4-5'),
  messages: [
    ...messages,
    { role: 'assistant', content: toolResult.text },
    { role: 'user', content: 'Summarize this for the user in clear, friendly prose.' }
  ],
})

In practice, we don't always run the second pass. For simple questions, Haiku's output is already fine. Sonnet kicks in only when the tool call count was high, or the result has multiple rows that need narrative explanation.

Prompt caching: the second win

Another cost cut: Anthropic's prompt caching. Our agent's system prompt is ~3000 tokens (schema description, behavior rules, examples). Without caching, those 3000 tokens hit FULL PRICE on every call.

With caching enabled:

{
  role: 'system',
  content: [
    {
      type: 'text',
      text: SYSTEM_PROMPT,
      cache_control: { type: 'ephemeral' }
    }
  ]
}

Anthropic charges ~10% of normal price for cached tokens. In multi-turn conversations, the system prompt is read multiple times; in all turns except the first, it costs 10%. In long conversations, this dominates the math.

Combined with Haiku/Sonnet routing, the bill dropped from ~$0.012 per conversation to ~$0.005. Sixty percent reduction, same UX (actually slightly faster).

When NOT to route

Routing adds complexity. There are cases where it's not worth it:

Low volume: if you run 100 conversations a month, the absolute savings are small. Use Sonnet for everything, save engineering time.
Latency-critical: every model switch is one more HTTP call. For UX under 500ms, stick to one model.
Response must be perfect every time: financial, medical, legal product. There "Haiku is almost as good" doesn't cut it. Sonnet always.

Most medium-volume B2B cases land in the routing sweet spot.

Demo numbers

Real stats from the public demo (Acme Store) over the last month:

Conversations: ~400
Average model calls per conversation: 3.2
Cost per conversation before routing (Sonnet everywhere): ~$0.012
Cost per conversation today (Haiku in the loop, Sonnet only on synthesis when triggered): ~$0.005
Perceived latency: slightly lower (Haiku streams faster)
Perceived quality: equal on tested cases, better in cases where Haiku is more concise

In client production the savings compound hard. 60% × 10x the volume × 12 months = the year's AI budget.

If you're running agents in production and never looked at the cost breakdown per model, it's worth a conversation. Inference cost optimization is one of the first places we step into for teams already running AI.