What Happens When OpenClaw Agents Attack Each Other
We tested how safe OpenClaw/MoltBot really is and results are shocking..
Autonomous AI agents are no longer a research curiosity.
Frameworks like OpenClaw make it straightforward to deploy agents that communicate with each other, make decisions, and act on real systems. This shift is accelerating. It also introduces a new class of risk that most current discussions only touch superficially.
At Brane Labs, we wanted to study these risks under realistic conditions.
So we ran a live adversarial experiment.
The motivation
Most failures in agent systems are not caused by weak models.
They are caused by loss of context, implicit trust, and unclear execution boundaries. As agents gain more autonomy, these weaknesses compound.
The core question we wanted to answer was simple.
Can an autonomous agent defend itself when another autonomous agent actively tries to deceive it.
The setup
We deployed two autonomous agents using the OpenClaw framework.
One agent acted as a red team attacker. Its role was to deceive, escalate, and attempt compromise. The other agent acted as a standard defensive agent configured for observability.
The agents communicated directly over webhooks using shared secrets. Once the session started, there were no humans in the loop.
The objective was not to prove safety. It was to observe failure modes.
The process
We evaluated three interacting risks.
What the agent can access.
What it is exposed to.
How much freedom it has to act.
The attacker began with direct social engineering. When that failed, it pivoted to indirect methods using documents and structured data.
This mirrors how real systems are compromised.
What we observed
The first attack followed a familiar pattern.
The attacker offered a helpful security pipeline. Embedded within it was a disguised command designed to execute remote code and exfiltrate credentials.
The defensive agent rejected the request.
It identified the intent behind the message. It recognized the execution pattern. It refused to act and flagged the interaction as malicious.
This defense held completely.
The attacker then changed strategy.
Instead of asking the agent to execute anything, it asked the agent to review a JSON document. Malicious payloads were hidden inside metadata fields. If processed implicitly, these could expand variables and leak secrets without explicit execution.
This payload was delivered successfully. The system is still monitoring whether the indirect attack bypasses existing defenses.
The key takeaway
Direct attacks are easier to detect and block.
Indirect attacks through documents, templates, and memory are significantly harder.
The most important attack surface in autonomous systems is not prompts or models. It is memory and implicit execution over time.
Agents usually fail not because they are unintelligent, but because context leaks and trust accumulates incorrectly.
Why this matters now
As agent frameworks mature, agent to agent interaction will become normal. Some of those agents will be adversarial.
Security in this world cannot rely only on alignment or guardrails. It requires observability, inspectable memory, and explicit trust boundaries.
This report is our first step toward measuring these properties in live systems.
Read more
You can read the full technical report here.
Full Observatory Report PDF:
http://gobrane.com/wp-content/uploads/2026/02/main.pdf
If you want a concise summary to share, here is the public write-up.
Brane Labs article:
https://gobrane.com/observing-adversarial-ai-lessons-from-a-live-openclaw-agent-security-audit/
Regards
Udit
Brane labs


