An uncomfortable number
Multiple research studies tracking multi-agent system deployments paint a consistent picture: failure rates in production environments range from 41% to as high as 87%, depending on system complexity and the definition of “failure.” A 2025 survey of enterprise AI deployments found that 79% of multi-agent system failures traced back to specification and coordination problems — not model capability, not hallucination, not even wrong answers.
The agents were smart enough. The infrastructure connecting them wasn’t.
This is a counterintuitive finding in an industry obsessed with model benchmarks. Teams pour resources into choosing the right model, fine-tuning prompts, and optimizing inference — then watch their multi-agent system fail because Agent A couldn’t reliably communicate intent to Agent B, or because a task decomposition error in the orchestrator cascaded into system-wide failure.
The problem isn’t intelligence. It’s plumbing.
Failure mode 1: The semantic gap
Multi-agent systems can pass messages. What they can’t reliably do is share meaning.
Consider a simple scenario: a Planning Agent decomposes a customer request and delegates sub-tasks to a Research Agent and a Writing Agent. The Planning Agent sends a message: “Research competitor pricing for enterprise SaaS in the CRM category.” Seems clear to a human. But:
- Does “competitor” mean the customer’s competitors or the customer’s product’s competitors?
- Does “pricing” mean list prices, negotiated prices, or total cost of ownership?
- Does “enterprise SaaS” have a revenue threshold? Employee count?
- Does “CRM category” include marketing automation platforms with CRM features?
Humans resolve these ambiguities through shared context, follow-up questions, and organizational knowledge. Agents operating in a multi-agent system typically have none of these. Each agent has its own context window, its own system prompt, its own interpretation of terms.
This isn’t a new problem. FIPA (Foundation for Intelligent Physical Agents) spent 25 years — from the late 1990s through the 2010s — trying to solve agent interoperability through formal ontologies and standardized speech acts. The lesson from FIPA’s partial success and eventual reduced relevance: formal specification of meaning doesn’t scale. You can’t pre-define every possible concept and relationship that agents might need to share.
Current protocols like A2A solve the transport problem — messages get delivered reliably. But the semantic gap persists. Two agents can exchange perfectly formatted JSON and still fundamentally misunderstand each other, because there’s no shared schema for intent.
Failure mode 2: Task decomposition cascade
Task decomposition — breaking a complex goal into sub-tasks assigned to different agents — is the foundation of multi-agent architectures. It’s also the single most fragile operation in the entire system.
Here’s why: task decomposition is a decision tree where early choices constrain all downstream work. If the orchestrator decomposes a task incorrectly at the top level, every sub-agent receives a flawed mandate. And unlike human teams where a team member might push back (“I don’t think this is the right approach”), agents typically execute their assigned sub-task without questioning the decomposition.
A well-documented pattern from production systems:
- Orchestrator receives: “Prepare a competitive analysis report for the Q2 board meeting.”
- Orchestrator decomposes into: (a) gather market data, (b) analyze competitor financials, (c) draft recommendations.
- Problem: the orchestrator didn’t include “review our own Q1 performance” as a sub-task, because the board meeting context wasn’t in its prompt.
- Result: three agents produce excellent work on their sub-tasks, and the final report is fundamentally incomplete.
The failure isn’t in any individual agent’s capability. It’s in the decomposition. And decomposition failures are hard to detect because each sub-agent’s output looks correct in isolation.
Research on multi-agent coordination consistently identifies this as the “cascade” problem: a single decomposition error propagates through the entire agent team, compounding at each level. In systems with three or more levels of decomposition, the probability of at least one meaningful decomposition error approaches certainty in complex tasks.
Failure mode 3: Context erosion
Every time a message passes from one agent to another, context is lost. The receiving agent gets the message content but not the sender’s full reasoning, not the broader task context, not the history of decisions that led to this particular message.
In a linear agent chain (A → B → C), the context available to each agent decreases monotonically:
Agent A ████████████████████ 100% context
│ (lossy summary)
Agent B ██████████████ 70% context
│ (lossy summary)
Agent C █████████ 45% context
│ (lossy summary)
Agent D ████ 20% context ← intent barely recognizable
- Agent A has full context: the original user request, system state, relevant history.
- Agent B has Agent A’s output, which is a compressed representation of Agent A’s context.
- Agent C has Agent B’s output, which is a compressed representation of Agent B’s compression of Agent A’s context.
By Agent C, the original intent may be barely recognizable. This is the “telephone game” problem, and it’s structural, not a bug. Each agent’s context window is finite, each agent’s output is a lossy summary, and there’s no mechanism for downstream agents to query upstream context.
Some systems attempt to solve this with a shared context store — a database or document that all agents can read and write. But shared state introduces its own coordination problems: race conditions, stale reads, conflicting writes. Without a formal consistency model, shared context stores often make the problem worse by giving agents the illusion of shared understanding while actually serving stale or contradictory data.
The absence of a standardized context model — a common schema for representing task state, decision history, and agent-specific context — is one of the most significant infrastructure gaps in the multi-agent ecosystem.
It’s not the agents. It’s the infrastructure.
A pattern emerges across all three failure modes:
| Failure mode | What works | What’s missing |
|---|---|---|
| Semantic gap | Messages delivered reliably | Shared understanding of intent |
| Task decomposition cascade | Sub-tasks executed correctly | Validation of decomposition quality |
| Context erosion | Information passed between agents | Standardized context model, no loss |
The agents are increasingly capable, but the infrastructure connecting them is thin.
This is reminiscent of a well-known anti-pattern in software engineering: the “Big Ball of Mud.” Components are individually functional but poorly connected, with ad-hoc communication, no clear interfaces, and no centralized observability. The system “works” in demos and simple cases, then collapses under the complexity of real-world tasks.
Model capability is advancing rapidly. GPT-4o, Claude, Gemini — each generation is meaningfully smarter than the last. But coordination infrastructure is advancing much more slowly. We have better agents running on the same thin coordination layer, which means failure rates stay high even as individual agent quality improves.
Three principles for multi-agent systems that work
Teams that have successfully deployed multi-agent systems in production tend to follow three principles:
1. Standardize communication protocols
Don’t build custom inter-agent communication. Use A2A for agent-to-agent interaction and MCP for agent-to-tool interaction. These protocols aren’t just convenience — they’re reliability infrastructure. Standard protocols mean standard error handling, standard discovery, standard lifecycle management.
The cost of custom protocols isn’t just development time. It’s the debugging time when Agent A’s custom JSON format doesn’t quite match Agent B’s expectations at 2 AM on a Saturday.
2. Centralize observability and audit
Every message between agents, every task decomposition decision, every context handoff should be logged in a central, immutable store. Not in each agent’s individual logs — in a dedicated observability layer that provides a system-wide view.
This isn’t just for debugging (though it transforms debugging from “impossible” to “tedious”). It’s for understanding failure patterns, validating decomposition quality, and meeting compliance requirements. You cannot improve what you cannot observe.
3. Enforce governance at the architecture level
Don’t rely on individual agents to self-govern. Agents will make mistakes, get confused, and occasionally behave in unexpected ways. Governance — access control, content filtering, action budgets, human approval gates — must be enforced by infrastructure that the agents cannot override.
This means governance in the message pipeline (who can talk to whom, what content is allowed) and governance in the execution layer (what tools can be called, how many actions are allowed). Two independent layers, neither controlled by the agents they govern.
The 87% failure rate isn’t a condemnation of multi-agent architecture. It’s a signal that the industry is building applications on incomplete infrastructure. The agents are ready. The infrastructure needs to catch up.
The teams that invest in coordination infrastructure — standard protocols, centralized observability, architectural governance — are the ones finding their way into the successful 13%. The rest keep building smarter agents on thinner foundations, and wondering why production keeps failing.