Skip to content
zechim
Back to blog
6 min readcase-studyagentswhatsappmultimodal

Zeobra: a WhatsApp-native agent for managing construction sites

How we built a multimodal AI agent inside WhatsApp that organizes payments, contracts, and daily site logs. Stack, decisions, and the pitfalls we avoided.

by Jonatas Zechim

Most of an architect's work during a construction project isn't design. It's logistics. Photo of an invoice, voice note from the contractor explaining what was paid, schedule spreadsheet, supplier contract, daily site log. All of it lands in their WhatsApp and ends up scattered across chats, photo albums, and phone folders.

Zeobra solves this by turning WhatsApp itself into the management tool. No new app to install. You send a photo, voice note, or text to "Zé" and he files everything into the right tables: payments, contracts, daily logs, documents. When you want to see the full picture, you open the web dashboard.

This post is about how it works under the hood: architecture decisions, the stack, and what we learned shipping it to production.

Why WhatsApp, not a new app

Architects and site owners already live in WhatsApp. Asking them to install a dedicated app, create an account, remember a password - that's where 90% of construction-management tools die. Adoption collapses.

The flip: the "new app" is the agent on the other side of the conversation. Onboarding is sending a message. There's nothing to learn. The thing the user does most on their phone - sending photos and voice notes in a chat - is the primary interaction.

Tradeoff: the interface is text + media, not grids and filters. For that we have a complementary web dashboard. But day-to-day capture and operations stay where the user already is.

High-level architecture

WhatsApp (user message)
  ↓
WhatsApp Business API (webhook)
  ↓
Zé Obra agent (orchestrator)
  ├─ Vision (OCR + image classification)
  ├─ Audio (transcription + intent)
  ├─ Text (LLM with tools)
  └─ Human confirmation loop
  ↓
Database (projects, payments, contracts, log, docs)
  ↓
WhatsApp reply + web dashboard updated

The agent receives the message, decides what it is (an invoice? a voice note narrating a payment? a site photo? a status question?), extracts the structured data, confirms with the user before writing, and replies back in WhatsApp.

The multimodal input

Every message can be text, image, or audio. The agent normalizes all three into a common shape before reasoning:

Images go through a vision model that extracts structured content. For an invoice: supplier, CNPJ (Brazilian tax ID), amount, date, line items where possible. For a site photo: a visual description ("brick wall under construction with labor working on the second floor"). Each image type has its own prompt and schema.

Audio is run through automatic transcription and then an intent layer. Typical contractor voice note: "I just paid R$ 250 to Casa do Construtor for cement and sand." Becomes: {type: payment, supplier: "Casa do Construtor", amount_cents: 25000, description: "cement and sand", confidence: 0.85}.

Text goes straight to the LLM with recent conversation history as context.

The important bit: the agent never trusts a single extraction 100%. The confidence flows into the confirmation message it sends back.

The human confirmation loop

Zé never writes silently. Every mutation passes through confirmation. The standard flow:

User: [photo of invoice]
Zé:   📎 Invoice filed in "House Renovation".
      R$ 1,250.00 → Casa do Construtor (Materials), paid today.
      Record as a payment? Reply yes or no.
User: yes
Zé:   Done, saved ✅ R$ 1,250.00 marked as paid.

Confirmation is asynchronous - the user might take minutes or hours to reply. The agent keeps the pending context tied to the phone number until the confirmation arrives (or expires).

This loop is what makes the system safe to use in production. When the model gets something wrong (it always gets some things wrong), the error dies at confirmation. No bad data slipping silently into the database.

Domain modeling

Construction has its own vocabulary. The central entities settled into:

  • Project (obra). A renovation, a new build, ongoing maintenance. Has name, address, scope, deadline.
  • Payment. Amount, supplier, category (labor, materials, equipment, fee), date, status (paid, scheduled, late), proof.
  • Contract. Contractor or supplier, scope, total amount, installments, dates.
  • Daily log (diário de obra). Dated entries: what was done today, how many people worked, open items.
  • Document. Invoices, regulatory forms, blueprints, labeled photos.

The agent knows this model deeply - it's part of the system prompt. When the user says "I bought paint at Leroy", the agent knows that's a Payment with category=Materials and supplier=Leroy Merlin, even when the sentence isn't structured.

Where the AI does nothing

A few things we deliberately do NOT automate:

  1. Contract approval. The model extracts the contract data (PDF), pre-fills the form, but approval is manual in the web app. Contractual risk is too high.
  2. "Borderline" payment categorization. If the extraction confidence falls below a threshold, we do NOT guess - we ask. "Is this R$ 800 payment labor or materials?" Costs one extra exchange, avoids reclassifying later.
  3. Automatic notifications to suppliers. Reaching out to third parties stays in the human owner's hands.

The general rule: the agent is great at structuring and finding, and we keep it away from deciding things where errors are expensive.

Stack

  • WhatsApp Business API via a provider (alternatives: WhatsApp Cloud API directly, Z-API, Twilio). Webhook receives messages, send API replies.
  • Language model: Anthropic Claude for the main agent. Sonnet for complex structured extraction, Haiku for simple classification and quick replies.
  • Vision: Claude with multimodal input - one call extracts invoice text and classifies the photo.
  • Audio: Whisper (via API) for transcription. Then passes through Haiku to extract intent.
  • Database: Postgres (Supabase). Schema with row-level security per project owner. security definer functions for mutations.
  • Frontend: Next.js App Router + React. The web dashboard shows the consolidated state and provides manual editing when needed.
  • Infra: Vercel for the frontend and webhooks. Supabase storage for media (invoice and site photos). Daily cron jobs for reminders and rollups.
  • Observability: every extraction logs the input, the model's output, the confirmation result, and the cost. Useful for evals and understanding where the model fails most.

What we learned

Human confirmation >>> "auto mode". Version zero of Zeobra tried to be autonomous: extract and write directly. Users started distrusting the system within week one. Putting the confirmation loop in the default flow gave the trust back - and to our surprise, didn't slow anyone down. People reply "yes" in seconds.

Multimodal is where the ROI is. Most construction management tools still require forms. When the input is "send photo + voice note + text" and the agent does the rest, friction disappears. That's the jump.

Domain before model. The agent's system prompt has more detail about construction (categories, supplier vocabulary, Brazilian tax rules) than about how to use tools. That's where the quality lift came from.

WhatsApp as a channel is viable and still underused. In Brazil, SaaS tools that ignore WhatsApp are missing most of the SMB market. Zeobra shows you can build a rich experience using the channel people already use.

What's next

The next phase is closing the loop: consolidated per-project reports, accountant integrations (Brazilian fiscal documents), export to the client's ERP. The conversational agent stays as the input, but the structured outputs (project closeout report, statement of accounts, balance sheet) get sharper.

If your company has a similar problem - heavy WhatsApp usage, scattered data, friction to structure - it's worth a conversation. We build systems like this with the same engineering that ran Airbyte, dbt, and BigQuery at senior teams, now with the agent layer on top.