We built an AI agent that actually works in production. Here's every line of code.

Let me tell you about the moment I stopped being skeptical about AI agents.

It was 11pm on a Tuesday. Our team had just deployed a customer support agent for a Series B fintech client. Not a chatbot. Not a glorified FAQ lookup. A real, multi-step reasoning agent that could pull a customer's transaction history from Snowflake, cross-reference it against compliance rules stored in Unity Catalog, search through 40,000 pages of policy documentation via Mosaic AI Vector Search, and compose a response that actually solved the problem. The first ticket it resolved autonomously would have taken a human analyst 25 minutes. The agent did it in 4.2 seconds.

That was 6 months ago. Since then, we have built and deployed 9 production agents on Databricks for clients across financial services, healthcare, and logistics. Some of them are simple retrieval agents. Some are full multi-agent orchestrations that make decisions, call tools, and self-correct when they get it wrong.

This post is the guide I wish existed when we started. I am going to walk you through building a production-grade AI agent on Databricks from scratch: the architecture, the code, the evaluation framework, the deployment pipeline, and the hard lessons we learned along the way. Every code snippet here is adapted from real production systems. Nothing is hypothetical.

Grab some coffee. This is going to be fun.

The problem: why most AI agents never leave the notebook

Here is the uncomfortable truth about AI agents in 2026: most of them are demos. They work in notebooks. They impress stakeholders in a conference room. Then they die quietly because nobody figured out how to evaluate them, govern the data they access, or monitor them after deployment.

The gap between "cool notebook demo" and "production system that handles 10,000 requests per day without hallucinating" is enormous. It involves answering questions that most tutorials skip entirely. How do you version an agent's tools separately from its model? How do you run automated quality checks before every deployment? How do you catch a hallucination before it reaches a customer? How do you do all of this while keeping your data governance team happy?

This is where Databricks and the Mosaic AI Agent Framework earn their keep. Unity Catalog gives you governed tool access. MLflow gives you versioned evaluation and deployment. Mosaic AI Vector Search gives you retrieval that actually scales. Model Serving gives you a production endpoint with monitoring baked in.

Let me show you what this looks like in practice.

The use case: an intelligent document agent for financial compliance

Our client is a mid-market financial services company processing 2,000+ compliance inquiries per week. Each inquiry requires an analyst to search through regulatory documents, cross-reference customer account data, apply business rules, and compose a response. Average handling time: 22 minutes. Error rate on manual review: 8%.

We built an agent that handles 70% of these inquiries autonomously, escalating the rest to human analysts with a pre-drafted response and supporting evidence. Here is the full architecture.

Architecture overview

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Now let's build it, piece by piece.

Step 1: Setting up the data foundation with Vector Search

Before your agent can reason about anything, it needs access to knowledge. In our case, that means 40,000 pages of regulatory documentation, internal policy guides, and compliance procedures.

Most teams reach for a third-party vector database at this point. We did not. Mosaic AI Vector Search is built directly into the Databricks platform, which means it inherits Unity Catalog governance automatically. Every query is auditable. Every document is access-controlled. Your compliance team will thank you.

First, we prepare and chunk the documents using a Delta table:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Next, create a Vector Search endpoint and sync the index:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The beautiful thing here: Databricks handles embedding generation automatically. You point it at a text column, specify an embedding model endpoint, and the platform manages the rest. When your source Delta table updates, the index syncs. No Airflow DAG. No cron job. No orphaned embeddings.

Step 2: Defining agent tools in Unity Catalog

Here is where things get interesting. In the Mosaic AI Agent Framework, tools are first-class citizens registered in Unity Catalog. This means they are versioned, governed, discoverable, and auditable, just like your tables and models.

Our agent needs three tools. Let's build them.

Tool 1: Regulatory document retriever

We wrap our Vector Search index in a Unity Catalog function so the agent can call it as a tool:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Notice the COMMENT on the function and each parameter. These are not decoration. The agent's LLM reads these comments to decide when and how to invoke the tool. Good tool descriptions are the difference between an agent that works and an agent that flails.

Tool 2: Customer account lookup

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Tool 3: Business rules engine

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Three tools. All governed by Unity Catalog. All with explicit comments that guide the LLM's tool selection. All queryable, auditable, and version-controlled.

Step 3: Building the agent with LangGraph and ChatAgent

Now we wire everything together. We are using LangGraph for the orchestration layer because it gives us explicit control over the agent's reasoning loop. We wrap the whole thing in an MLflow ChatAgent so it plugs directly into Databricks Model Serving.

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

A few things worth highlighting in this code.

The ModelConfig pattern. We externalize all configuration into a YAML file. Swap the LLM from Llama 3.3 70B to DBRX or GPT-4o by changing one line of config, then re-evaluate. No code changes required.

The escalation logic. This is where most agent tutorials stop and real production systems start. Our agent tracks reasoning depth, checks for self-reported low confidence, and escalates automatically for high-risk accounts. In production, 30% of queries get escalated. That is by design. A good agent knows what it does not know.

The ChatAgent wrapper. By subclassing ChatAgent and implementing predict and predict_stream, your agent automatically gets a REST API endpoint, request logging, trace capture, and integration with the Agent Evaluation framework. One class, and you are production-ready.

Here is the agent configuration YAML:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Step 4: Evaluating the agent before deployment

This is the step that separates production agents from demo agents. You would not deploy a machine learning model without evaluating it on a test set. Why would you deploy an agent without doing the same?

Databricks Agent Evaluation lets you run structured quality assessments using LLM judges, custom metrics, and human feedback, all tracked in MLflow.

First, build an evaluation dataset:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Now run the evaluation:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

Here is what our evaluation results looked like after three iterations:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The jump from V1 to V2 came almost entirely from better tool descriptions and a more explicit system prompt. The jump from V2 to V3 came from switching chunking from 1024-token to 512-token chunks with overlap, and adding the "cite your sources" instruction to the system prompt.

The ability to swap configurations and re-evaluate in minutes is what makes this iterative loop practical. Without it, you are guessing.

Step 5: Deploying to production with Model Serving

Deployment is where the Databricks platform really shines. Because our agent is wrapped in a ChatAgent, deploying it to a production endpoint is straightforward:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The ai_gateway configuration enables usage tracking and inference table logging — a complete audit trail of every decision your agent makes, written to a Delta table in Unity Catalog. Not optional in financial services.

Step 6: Production monitoring that actually catches problems

Deploying the agent is not the finish line. It is the starting line. Here is how we set up production monitoring:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The same evaluation logic you used in development now runs against production traffic. When groundedness drops below 0.85 or a safety issue is detected, alerts fire immediately. We caught a retrieval degradation issue within 45 minutes of it starting. Without monitoring, it would have been days.

Results and what they mean for your stack

After 6 months in production, here are the numbers:

[@portabletext/react] Unknown block type "code", specify a component for it in the `components.types` prop

The 2.1% error rate is lower than the human-only baseline because the agent is consistent. It does not forget to check a rule. It does not skip a step because it is the end of the day and it is tired. And when it is not confident, it escalates. Every time.

The 5 things we learned the hard way

After 9 production agents, here are the lessons that no documentation will teach you.

1. Tool descriptions are your most important prompt engineering. We spent 3x more time writing and iterating on Unity Catalog function COMMENT fields than on the system prompt. The LLM reads these to decide which tool to call and how to call it. Vague descriptions produce vague tool usage.

2. Start with aggressive escalation, then relax. Our V1 agent escalated 60% of queries. That is fine. We gradually tuned the confidence thresholds down as we built trust. Launching with a 5% escalation rate and hoping for the best is how you end up on the front page for the wrong reasons.

3. Chunk size matters more than model size. Switching from 1024-token to 512-token chunks improved our retrieval relevance by 26%. Switching from Llama 3.1 70B to Llama 3.3 70B improved response quality by 8%. Fix your retrieval before you upgrade your model.

4. Evaluation datasets are living documents. Every production failure becomes a new test case. Our evaluation dataset started at 30 examples and is now at 340. The dataset is the institutional memory of everything your agent has gotten wrong.

5. Inference tables are gold. The production logs that Model Serving writes to Delta tables are the richest source of insight you have. We run weekly analyses on them to find edge cases, measure tool usage patterns, and identify queries the agent struggles with. This feeds directly back into the evaluation dataset.

Where this is heading

We are currently building multi-agent orchestrations where a supervisor agent routes inquiries to specialized sub-agents (one for KYC, one for AML, one for transaction monitoring), each with their own tool sets and evaluation criteria. Databricks recently introduced Agent Bricks, which automates much of the agent construction process. We are watching this closely.

The infrastructure for production AI agents is finally mature enough that the bottleneck is no longer technology. It is knowing what to build and how to evaluate it. The code patterns in this post are our answer to the "how." The "what" is up to you.

If your team is sitting on a pile of unstructured documents, a set of business rules, and a manual process that is eating your analysts' time, you have all the ingredients. The Databricks platform gives you the tools. Unity Catalog gives you the governance. MLflow gives you the evaluation loop. The rest is engineering.

And engineering is what we do.

Rizwan Yousuf is VP of Data and AI at Blue Orange Digital, where he leads a team of data engineers and ML practitioners building production data platforms and AI systems. He has shipped 200+ production pipelines and 9 production AI agents across financial services, healthcare, and logistics. Book an architecture review | Explore our AI/ML practice

Rizwan Yousuf
Rizwan Yousuf
Vice President of Data and AI

25+ years leading large-scale enterprise data and AI projects, expert in AWS, Databricks, and GenAI architectures.