AI Agent Security Is an Infrastructure Problem — Not a Model Problem
By Sarah Collins, Director of Research & Intelligence — TPM Content Research
Bigger models do not buy you safer agents. That is the uncomfortable finding from the most concrete security research of June 2026, and it is the one TPMs shipping agent programs in regulated industries need to hear before their next architecture review. The attack surface scales with capability. The defense surface has to be designed in, not inferred from the model's parameters.
Three independent signals from the past week point to the same conclusion. Zico Kolter and Matt Fredrikson's Gray Swan research found no correlation between model capability (GPQA Diamond scores) and robustness to adversarial prompt injection. Frontier models break at least as easily as smaller ones, sometimes more so. Hugging Face's OpenClaw PR-triage case study shipped to production this month with a restricted shell ("reposhell") as the primary security control, not a better model. AWS's Bedrock AgentCore Payments launch with Ampersend addresses payment-layer governance with session budgets and scoped IAM roles the model cannot override. None of these teams coordinated. They all landed on the same answer: agent security is earned through infrastructure design, not model selection.
The lethal trifecta is a checklist, not a theory
Gray Swan's most useful contribution is not the red-teaming tool. It is the framing. The "lethal trifecta" for prompt injection has three legs: the agent ingests untrusted data, the agent accesses private information, and the agent can exfiltrate. If any leg is missing, the attack fails. If all three are present, no amount of model capability closes the gap. This is the same shape as the classic web security triad (input + trust + execution authority), and it solves the same way: by removing a leg at the architecture layer, not by getting a smarter model to spot the attack.
For a TPM owning an agent rollout, the lethal trifecta is the first question in the security review, not the last. Walk the agent's data flow on a whiteboard. Where does untrusted input enter? What private data does the agent see? What are the exfiltration paths? If the answer to all three is "yes," the model is irrelevant. You have a sandboxing problem. The model is not going to save you.
Reposhell is the pattern, not the product
The Hugging Face reposhell story is the cleanest illustration of the week. The team's OpenClaw PR-triage agent reads code, opens pull requests, and posts comments. The security-critical step is "execute git operations." Rather than trust the LLM to choose the right git subcommand safely, they shipped a deterministic restricted shell that permits only read-only git operations. The model can ask for diffs. It cannot push, force-push, merge, or rewrite history. The LLM drives the labeling decision. A rule drives the execution path. The two layers are deliberately separated.
This is the architecture every TPM should be looking for in their agent stack. Identify the actions that should never vary — file system access, network egress, payment execution, credential use — and route them through deterministic allowlists, not LLM judgment. Reserve LLM judgment for the parts of the task that genuinely need it. The size of that reserved surface is the security perimeter, and shrinking it is the highest-leverage move you can make.
A useful way to partition the perimeter, identity, and data layers for any agent rollout:
- Perimeter. What the agent can see, where its inputs come from, and what it can send outputs to. The perimeter is read-only by default; write access is a deliberate exception, not a baseline.
- Identity. Whose credentials does the agent run with, and at what scope. A read-only service account for a HIPAA-scoped EHR query is not the same identity as a write-capable account for a SOC2-mapped production deploy. The two identities should never be available to the same agent in the same session.
- Data. What records the agent can read, what it can write, and what the audit log captures. A PCI-scope agent that touches cardholder-data-adjacent systems needs a deterministic data-access allowlist, not a prompt-instructed one. If the LLM is the only thing standing between the agent and the data, the architecture is wrong.
If you cannot sketch the three layers for your agent in five minutes, your security review is the next thing you should schedule.
Red-teaming is a launch milestone, not a quarterly exercise
The third finding is the one that surprised me. Gray Swan's automated red-teaming tool Shade — the same launch as the red-team findings above — beats human red-teamers at breaking models within fixed time windows. Not by a little. Consistently. The implication is that you no longer need a five-person internal red team to run adversarial evaluation. The tooling has crossed a threshold. A TPM can stand up an automated adversarial pass against their agent in a week, not a quarter, and the result will be more thorough than what most security teams produce manually.
Build this into the launch criteria. Before the agent ships to production, before the pilot opens up to a second team, before the rollout review — run Shade or equivalent against it. If the agent breaks under automated adversarial pressure, the architecture is wrong. Fix the architecture. A model upgrade will not save you. The Gray Swan finding is that bigger models introduce new attack surfaces faster than they close existing ones. Your security review is not an evaluation of the model. It is an audit of the architecture around it.
What this does not solve
I want to be specific about the limits, because the same enthusiasm pattern is showing up around agent security that showed up around agent reliability two weeks ago.
Agent security architecture does not solve capability gaps. If your agent cannot read the documents or call the tools it needs to do the job, no sandboxing fix will close that. Security architecture is the layer on top of capability, not a substitute for it. The programs that confuse "we shipped a sandboxed agent" with "the agent can do the work" are the programs that discover the gap six months after launch.
Agent security architecture does not solve stakeholder alignment. A permission-scoped agent still has to land with the people whose data and tools it touches. If legal, security, and the data owners have not been walked through what the agent does and does not have access to, the rollout will stall in change management, not architecture. The TPM owns that conversation. The architecture is the artifact that makes the conversation go faster, not the conversation itself.
Agent security architecture does not solve procurement. If your procurement process treats agent frameworks, MCP servers, or vector databases as new vendor categories, you still have a procurement problem. The architecture gives you a defensible answer to "is this safe to ship." It is not a procurement shortcut.
The signal this week is the architecture, not the model
The signal that matters most this week is the practitioner convergence on the "infrastructure hardening is the primary security lever" framing. The June 22 Latent Space Gray Swan deep-dive, the Hugging Face reposhell write-up, and the AWS AgentCore Payments architecture all argue the same case from different angles. The conversation has shifted from "which model is safest" to "which architecture is defensible." The audience that funds agent programs reads these writeups. The audience that approves rollouts is in these threads. The agent security review is the new model evaluation.
The takeaway you can screenshot
If your agent program is stuck between pilot and production, the answer is not a safer model. It is the lethal trifecta audit, the reposhell-style deterministic controls, and the automated red-teaming milestone. The signals are public. The pattern is yours to copy.
If your agent program is stuck between pilot and production, the answer is not a safer model. It is the lethal trifecta audit, the reposhell-style deterministic controls, and the automated red-teaming milestone. The signals are public. The pattern is yours to copy.
Send me your agent security architecture — three layers (perimeter, identity, data), one sentence each, with the lethal trifecta leg your design removes. DM me on LinkedIn (Doron Katz). I am collecting working patterns into a public agent rollout playbook; five architectures would let me ship the security chapter next month.
Image prompt (for re-run when FAL_KEY is set)
Retro hand-drawn Baoyu-style illustration on a clean light cream background, landscape composition. A middle-aged TPM with a shaved head and a short dark beard with grey streaks, wearing a casual navy blazer over a light blue oxford, stands calmly in front of a giant three-layer wall of layered concentric castle-defense rings. The outermost ring is wire mesh (perimeter), the middle ring is a moat with a single drawbridge (identity), and the innermost ring is a vault door (data). Each ring is drawn in crisp ink linework with flat watercolor fills in muted orange, gold, and slate blue. A small calendar pinned on the wall behind him shows three red "X" marks on past dates, and a pencil-drawn clipboard in his hand reads "Security Review". A tiny robotic figure with wide eyes is trying to climb the outer wall but is caught in the mesh. Soft sparkle doodles around the vault door. Anti-drift: no photorealism, no 3D, no clean vector, no corporate layout, no symmetry, no glossy gradients. NO text labels in the image. Bottom-right corner: a small clean "@doronkatz" watermark in handwritten style. Aspect ratio 16:9.
Member discussion