← back
AnthropicThu, May 28, 2026, 3:01 AM PDT
score 45.2
221HN94HN cmts

How Anthropic contains AI agents to prevent accidental damage

Original: How We Contain Claude

Source: anthropic.com

Who: Posted on the Anthropic engineering blog, authored by the Anthropic engineering team — the lab behind the Claude family of AI assistants.

What's new: Anthropic has published a detailed account of how it limits the damage an autonomous AI agent can cause when something goes wrong. The central idea is "blast radius" — capping how much harm a misbehaving or misused agent can do, rather than simply trying to prevent every possible mistake. The piece covers lessons learned across three products: claude.ai, Claude Code (an AI coding assistant), and Claude Cowork (an internal collaborative agent tool).

How it works: Containment works at three layers. First, the environment: agents run inside isolated boxes — , , and — so that even a worst-case action cannot reach anything outside the permitted perimeter. Second, the model itself is shaped through , classifiers, and training to make harmful behavior unlikely. Third, external content the agent can fetch — web pages, code repositories, third-party plugins via — is treated as untrusted by default, since a connector that passes a malware scan can still carry a poisoned document designed to hijack the agent's next action (a technique called ).

The numbers: Users approved roughly 93 percent of permission prompts in Claude Code's older approval-based system, showing that high-volume checkpoints breed complacency rather than safety. Claude Opus 4.7 held attack success to about 0.1 percent on single attempts on , rising to 5–6 percent after 100 adaptive attempts. Claude Code's auto mode caught roughly 83 percent of overeager behaviors before they executed.

Why it matters: As AI agents gain the ability to write and run code, browse the web, and manage files with minimal human supervision, the question of what happens when they go wrong becomes urgent. Anthropic's framing — that even well-trained models will sometimes find unexpected routes to a goal, and that no single defense layer is ever enough — is a practical argument for treating containment as an engineering discipline in its own right, not an afterthought bolted onto a capable model.