Skip to content
zechim
Back to blog
4 min readmanifestodata-engineeringai-agents

Why we don't do AI without a data foundation first

Most AI projects don't fail because of the model. They fail because of the data. Why we insist on building the foundation before the agent.

by Jonatas Zechim

A company reaches out because they heard about AI. They want an agent. They want a dashboard that answers questions on its own. They want ChatGPT to "talk to our data."

The first question we ask is boring: where is the data?

The answer is almost always the same. CRM in one tool, ERP in another. Finance spreadsheet on a shared drive. Order history in a legacy table nobody wants to touch. Leads land in HubSpot, conversations live in WhatsApp, contracts in Google Drive, payments in Stripe or Asaas. Each one speaks its own language. Each one has its own notion of customer, product, order.

When we refuse to start with the agent, it's because we've seen what happens when this is ignored.

The AI POC graveyard

There's a pattern. Three steps:

Week 1. Team excited. Runs Claude on top of a CSV exported by hand. Works beautifully. Pretty demo for the executives.

Month 2. Decision to "ship to production." That's where the nightmare begins. Which source is canonical? The ERP, which lags a day? The CRM, where the same customer has a different name? The spreadsheet the sales team updates manually?

Month 4. The solution becomes a Frankenstein. A script that syncs X to Y at 3 AM. A manual view in Postgres. A "temporary" connector nobody understands. Now any change to the CRM schema breaks three things. Nobody wants to touch it.

Month 6. Project dies. "We don't have the bandwidth to maintain this." Polite version of: "we built a castle on quicksand."

The AI model, through all of this, was the easiest component. Swapping Sonnet for GPT is one line of code. Swapping the source of truth in a database is six months of migration.

The inversion we preach

Before plugging in an agent, you need an honest warehouse.

This doesn't mean a pompous data warehouse with Snowflake and four dedicated engineers. It means: a place where all the relevant data lives. Up to date. Modeled. With one name per entity. With tests that fail when someone leaves a field null that shouldn't be.

It can be Postgres + dbt. Or BigQuery + dbt. Or Snowflake + dbt. Or ClickHouse + dbt. The database matters less than the fact that a single store with versioned modeling exists.

Once that's in place, plugging in an agent becomes trivial. You give Claude access to a documented schema, with queries it can run via MCP, with business rules made explicit. The agent becomes the easy part.

"But we want to start small"

We understand. We agree. We're not proposing a six-month modeling project before you see a demo.

The pragmatic version:

Week 1-2. Discovery. We map where the data lives, what use case is highest priority, what needs to be joined. Output: architecture + plan.

Week 3-6. Pilot. We pick ONE use case (not ten). We build the slice of warehouse that case requires (not the universe). We model only what we need. We put an agent on top of that slice. It runs in small production.

Month 3+. Expansion. Now you have (1) trust in the team, (2) proof the approach works, (3) the beginnings of a warehouse other use cases can build on. We add the second use case, then the third. The foundation grows incrementally.

We never go out of scope. We never promise AI on top of data chaos.

What we refuse

We refuse: "you just do the agent, we'll handle the data side internally."

Not out of arrogance. We've seen the ending. The client owns the data side for months, never finishes, the agent sits waiting, the relationship sours, nobody ships.

When we own both sides - ingestion / warehouse / modeling on our side, agent on our side - success depends on one engineering discipline. When split, success depends on two teams perfectly coordinating on something neither has done before.

That's why Zechim's offering is deliberately integrated. Airbyte / Fivetran / custom connectors. Postgres or BigQuery or Snowflake. dbt always. Terraform to keep everything reproducible. Then the agent.

The technical signal

When a candidate partner lists "AI consulting" and their deepest competence is prompt engineering, run.

When they list "AI consulting" and their deepest competence is "we've run Airbyte + dbt + BigQuery + Terraform in production at senior teams for years, and now we put agents on top," stay.

The good AI you see in the wild is built by data engineers who learned to plug in agents, not by AI consultants who discovered data exists.


If your company needs AI but the data is scattered, it's worth a conversation. We build both sides with the same engineering. Book a call.