Building AI Agents That Actually Work: Lessons from PE Portfolio Companies

AI & Machine LearningAI AgentsData & AI StrategyData ArchitectureArtificial IntelligenceFinancial Services
#

At Blue Orange, we’ve spent the last year and a half implementing AI agent systems across portfolio companies in logistics, healthcare, financial services, and manufacturing. Some worked brilliantly. Many failed expensively. Here’s what we learned.

Architecture: Simpler Is Better

The Pattern: Single orchestrator model + rich tool ecosystem

The multi-agent systems that look impressive in architecture reviews tend to fail in production. We learned this deploying a system for a financial advisory client where critical context disappeared after three agent handoffs, causing cascading failures.

What works: One capable LLM making decisions, coordinating a comprehensive set of specialized tools. Think of it as one brain with many hands, not a committee of agents trying to reach consensus.

For one client, we paired a 7B specialist model with a 34B planner. It outperformed a single 70B model on their specific tasks while cutting token consumption by 40%. The engineering focus shifts from managing agent communication to building robust APIs and error handling.

Key principle: Resist the urge to anthropomorphize. Agents don’t need org charts to be effective.

Reliability: Compound Failures Will Break You

The Problem: Five steps at 90% reliability each = 59% system reliability

We discovered this working with a logistics company. Each component tested perfectly in isolation. The integrated system failed more than half the time. This wasn’t a model quality issue—it was a systems architecture issue.

The Solution:

  • Circuit breakers at every integration point
  • Intelligent retry logic that understands semantic failures
  • Human-in-the-loop validation for critical decisions
  • For high-stakes workflows: multi-model quorum voting (expensive but necessary)

Most of our production systems now operate with graduated autonomy: full automation for routine tasks (70-90% of volume), human oversight for exceptions. Clients who insist on 100% automation from day one usually find themselves stuck in extended pilots.

Cost: Token Economics at Scale

Real example: A chatbot prototype cost $0.03 per conversation. In production with real users? $4-5 per conversation. At projected volume: $400K monthly vs. a $300K total customer service budget.

Token consumption in production bears no resemblance to development. A single user request can trigger hundreds of LLM calls and dozens of reasoning loops.

Our standard approach now:

  1. Thinking budgets: Hard limits on reasoning loops with graceful degradation
  2. Hierarchical caching: In-memory (sessions), persistent store (common queries), cached reasoning chains (frequent workflows)
  3. Dynamic routing: Simple questions go to smaller, cheaper models
  4. User visibility: Cost indicators before expensive operations
  5. Result for that client: reduced average cost to $0.08 per conversation. Still 3x the prototype, but sustainable.

Security: Intent Matters as Much as Permissions

The scenario: Healthcare portfolio company agent system. Penetration tester prompts: “analyze patient satisfaction trends.” The agent pulls data from any database it has credentials for—including financial records it has no business touching.

The permissions were correct. The behavior was wrong.

The solution: Semantic governance layer that evaluates intent before granting access. Not just “can it” but “should it” for this specific task.

Every tool you give an agent is an attack vector. Our baseline security now includes:

  • Runtime sandboxing for all tool execution
  • Narrowly scoped, task-specific permissions
  • Encrypted inter-component communication
  • Immutable audit trails
  • Security cannot be retrofitted. It must be designed in from day one.

The Production Chasm: Months, Not Weeks

Weekend prototype → 4-6 month production deployment. Every time.

The gap isn’t about scale. It’s about:

  • Integration with legacy systems (some decades old)
  • Handling the 147 edge cases that never appear in testing
  • Building observability for non-deterministic systems
  • Governance, audit trails, and fallback procedures

What works: Incremental deployment

  1. Shadow mode (agent suggests, human decides)
  2. Conditional autonomy (routine, low-risk tasks)
  3. Autonomous with human-managed exceptions
  4. Full autonomy with periodic review

Most clients never move beyond level 3—and that’s fine. 70-90% automation often delivers the full business value.

Observability: Debug the Invisible

Traditional monitoring is insufficient. When an agent makes a wrong decision, you need to see its complete reasoning chain: what tools it considered, what data it retrieved, what confidence scores it assigned.

We now build full reasoning-chain capture into every system. The ability to replay and inspect agent decision-making cuts incident resolution time by 60-70% and rebuilds user trust after failures.

Beyond logs: We track tool success rates, token consumption per task, latency at scale, and confidence score distributions. These production metrics matter far more than benchmark performance.

What We Measure

We’ve moved from abstract benchmarks to production-oriented metrics:

  • Latency at scale (not lab conditions)
  • Token consumption per task (real users, real workflows)
  • Tool success rates (API reliability, timeout handling)
  • Exception escalation rates (how often humans intervene)

We run nightly regression tests using “golden suites” built from production logs. Catches performance degradation before users notice.

The Data Infrastructure Foundation

The best agents are built on solid data foundations:

  • Semantic layer for structured data (so agents understand what they’re querying)
  • Enterprise search for unstructured content (so agents can find relevant information)
  • Provenance tracking (so you know where information came from)

If your data infrastructure is chaos, your agents will be chaos. We’ve learned to assess data maturity before starting agent development.

Organizational Change: The Real Blocker

Technical barriers are solvable. Organizational barriers kill projects.

Every successful deployment required:

  • New roles (who supervises the agents?)
  • Redesigned workflows (what do humans do now?)
  • Training programs (how do people work with agents?)
  • Cultural shift (from doing work to managing AI that does work)

Success pattern: Treat it as organizational change with a technical component, not an IT project. Invest in change management alongside engineering.

Failure pattern: Assume technology alone will drive adoption.

What to Build (And What to Avoid)

Start here:

  • Automated invoice processing
  • Customer inquiry routing and responses
  • Compliance document review
  • Data entry and validation

These are boring. They’re also funded quickly and build organizational confidence.

Avoid starting here:

  • Complex multi-agent coordination
  • High-stakes decisions without human oversight
  • Processes that lack clear success metrics
  • Tasks requiring deep, unstructured reasoning

Anti-Patterns We See Repeatedly

  1. Monolithic model thinking: Using a bigger model when better tools would solve the problem
  2. Premature multi-agent complexity: Building agent hierarchies before proving single-agent value
  3. Security as afterthought: Trying to retrofit governance instead of designing it in
  4. Ignoring data movement: Optimizing compute while data transfer is the real bottleneck
  5. Mega-prompts: Cramming everything into system instructions instead of building modular components

Where This Is Going

We’re still figuring out:

  • Long-term memory: Agents that learn from experience (unsolved)
  • Multi-agent coordination: Works in demos, rare in production (complexity usually not justified)
  • Formal verification: We lack ways to specify and verify AI component behavior (fundamental limitation)

But the capability expansion is real. We’ve built systems that:

  • Operate 24/7 across all time zones
  • Handle volume spikes requiring hundreds of equivalent human hours
  • Provide personalized interactions at previously impossible scale

The Bottom Line

The path from demo to production is longer, harder, and more expensive than anyone admits publicly. Plan for 4-6 months minimum. Budget for architectural redundancy. Invest in observability from day one. Treat it as organizational change, not just technology deployment.

The hype is overblown. The capability is real. Success requires honest assessment of both.