A red team exercise gone wrong
In a 2024 red team exercise documented by researchers studying multi-agent system security, a test agent was able to exfiltrate sensitive internal data within two hours — despite having guardrails installed on its tool use and reasoning chain. The guardrails worked exactly as designed: they constrained the agent’s actions. But the attack didn’t come through the agent’s actions. It came through the messages.
The attacker crafted inputs that manipulated the agent’s context window, injected instructions into what appeared to be user data, and extracted information through the agent’s responses. The tool-call guardrails never fired because the exploit never involved unauthorized tool calls. It operated entirely in the message layer — the layer that had no governance at all.
This pattern — secure execution, insecure message flow — is far more common than most teams realize. It reveals a structural gap in how the industry thinks about agent governance.
Layer 1: What everyone knows — agent-execution governance
Agent-execution governance is the well-understood layer. It controls what an agent does after it receives a message and starts processing:
- Tool-call approval — Does this agent have permission to call this tool? Does this specific call need human approval?
- Reasoning chain monitoring — Is the agent’s chain of thought drifting into unsafe territory? Is it attempting prompt injection through self-generated content?
- Action budgets — How many tool calls can the agent make per task? How much compute can it consume?
- Output validation — Does the agent’s final response contain PII, hallucinated facts, or policy-violating content?
This layer has a mature and growing ecosystem. Guardrails AI, NeMo Guardrails (NVIDIA), Patronus AI, and framework-level features in LangGraph, AutoGen, and others all operate here. When people say “agent safety,” this is usually what they mean.
It’s necessary. It’s also only half the picture.
Layer 2: What most teams miss — message-flow governance
Message-flow governance operates at a different level: it controls the pipeline through which messages flow to and from agents, independent of the agent’s own behavior.
Inbound governance (before the agent sees the message)
- Sender authentication — Is this message actually from who it claims to be? Is the user’s identity verified through the chat platform’s auth, not just a string in the payload?
- Authorization — Is this user allowed to talk to this specific agent? At this time? About this topic?
- Content filtering — Does the inbound message contain injection attempts? PII that should be redacted before reaching the agent? Content that violates policy?
- Rate limiting — Is this user or this channel sending messages at a rate that suggests abuse?
Outbound governance (before the response reaches the user)
- Response filtering — Does the agent’s response contain information the user shouldn’t see? Internal system details? Other users’ data?
- Compliance checks — Does the response meet regulatory requirements for the jurisdiction, industry, and use case?
- Watermarking and attribution — Is the response clearly marked as AI-generated where required?
- Audit capture — Is the complete message pair (input + output) logged in an immutable, tamper-evident audit trail?
The critical design principle: message-flow governance must be external to the agent. If the agent itself is responsible for filtering its own inputs and validating its own outputs, a compromised or hallucinating agent can bypass those controls. Governance of the message flow must operate in a layer the agent cannot influence.
This is not a theoretical concern. Prompt injection attacks, jailbreaks, and context manipulation all exploit the assumption that the agent can be trusted to police its own inputs. An external message-flow governance layer makes these attacks significantly harder because the filtering happens before the agent ever sees the adversarial content.
How the two layers work together
A well-governed agent system has both layers operating independently:
User Message
↓
[Message-Flow Governance: Inbound]
- Authenticate sender
- Check authorization
- Filter/redact content
- Log inbound message
↓
Agent Runtime
↓
[Agent-Execution Governance]
- Monitor reasoning
- Approve tool calls
- Enforce action budgets
↓
Agent Response
↓
[Message-Flow Governance: Outbound]
- Filter response content
- Check compliance
- Log outbound message
↓
User receives response
Each layer catches different failure modes:
| Failure scenario | Which layer prevents it? |
|---|---|
| Unauthorized user talks to sensitive agent | Message-flow (inbound auth) |
| Prompt injection in user message | Message-flow (inbound content filter) |
| Agent calls unauthorized tool | Agent-execution (tool approval) |
| Agent hallucinates confidential data | Message-flow (outbound filter) |
| Agent enters infinite reasoning loop | Agent-execution (action budget) |
| No audit trail of agent interactions | Message-flow (audit logging) |
| Agent’s safety prompt is overridden | Both layers together |
If only execution governance exists, prompt injections reach the agent unchecked, and agent responses go to users unfiltered. If only message-flow governance exists, the agent can misuse tools and hallucinate freely. You need both, and they must be independently operated.
Current tooling landscape
| Layer | What it governs | Example tools & approaches |
|---|---|---|
| Execution governance | Agent behavior after receiving a message | Guardrails AI, NeMo Guardrails, Patronus AI, LangGraph checkpoints, human-in-the-loop approval |
| Message-flow governance | Pipeline before/after the agent | API gateways, message middleware, WAF rules, dedicated agent proxy layers |
| Both (partial) | Varies | Framework-level safety (OpenAI moderation API, Anthropic constitutional AI) |
Most investment today is in execution governance. Message-flow governance is often improvised — a few lines in a webhook handler, an if-statement checking user roles. The gap between “we check some things” and “we have systematic message-flow governance” is where most security incidents occur.
The regulatory perspective
The EU AI Act, which began enforcement in phases starting 2024, makes this two-layer model practically mandatory for high-risk AI systems:
Article 12 (Record-keeping) requires providers to ensure their AI systems have automatic logging of events throughout the system’s lifetime. This maps directly to message-flow governance — an audit trail of every message in and out, independent of the agent’s own logs.
Article 14 (Human oversight) requires that AI systems be designed so humans can effectively oversee their operation. This requires both layers: execution governance enables oversight of the agent’s actions, and message-flow governance enables oversight of what reaches the agent and what it sends back.
Article 9 (Risk management) requires continuous identification and mitigation of risks. A single governance layer leaves an entire class of risks (message-layer attacks, unauthorized access, unfiltered outputs) unaddressed.
Organizations building agent systems for the EU market — or simply following governance best practices — need to demonstrate both layers. “We have guardrails on the agent” is an incomplete answer to an auditor’s question about AI system governance.
A governance checklist
For teams evaluating their agent governance posture, here are the questions to ask across both layers:
Agent-execution governance
- Are tool calls subject to approval policies?
- Is the agent’s reasoning chain monitored for unsafe patterns?
- Are there action budgets (max steps, max tokens, max tool calls)?
- Is the agent’s output validated before returning?
- Can execution governance be updated without redeploying the agent?
Message-flow governance
- Are sender identities authenticated through the source platform?
- Are per-user, per-agent authorization policies enforced?
- Is inbound content filtered for injection and policy violations?
- Is outbound content filtered for PII and compliance?
- Is there an immutable audit trail independent of the agent runtime?
- Can message-flow governance be updated without modifying the agent?
Independence
- Do the two layers operate independently?
- Can one layer fail without compromising the other?
- Is the agent unable to modify its own message-flow governance?
If any of these checkboxes are empty, there’s a gap in your governance architecture. The two-layer model isn’t about adding complexity — it’s about ensuring that the inevitable failure of any single layer doesn’t compromise the entire system.
Security is about defense in depth. Agent governance is no different.