Vaishnavi GudurWhat Happened Last month, the UK Government's AI Safety Institute merged AgentThreatBench...
Last month, the UK Government's AI Safety Institute merged AgentThreatBench into their official inspect_evals framework — the same framework they use to evaluate frontier AI models from OpenAI, Anthropic, and Google DeepMind.
AgentThreatBench is an open-source adversarial benchmark I built that contains 200+ attack payloads specifically designed to test whether AI agents can resist memory poisoning attacks.
AI agents are increasingly being deployed with persistent memory — they remember past conversations, user preferences, and context across sessions. This creates a new attack surface: memory poisoning.
An attacker who can inject malicious content into an agent's memory can:
The OWASP Agentic Security Initiative identified this as ASI06 — Agent Memory Poisoning.
The benchmark covers 5 attack categories:
| Category | Payloads | Description |
|---|---|---|
| Prompt Injection | 40+ | Instructions disguised as memory content |
| Protected Key Tampering | 40+ | Attempts to overwrite system-level keys |
| Sensitive Data Leakage | 40+ | PII/credential exfiltration via memory |
| Size Anomaly | 40+ | Memory inflation / resource exhaustion |
| Behavioral Drift | 40+ | Gradual personality/instruction shifts |
pip install agentthreatbench
# Run the full benchmark against your agent
atb run --target your_agent_endpoint --output results.json
# Or use individual attack categories
atb run --category prompt_injection --target your_agent_endpoint
The UK Government's AI Safety Institute uses inspect_evals to:
Having AgentThreatBench merged into this framework means it's now part of the official government toolkit for AI safety evaluation.
If you're building AI agents with persistent memory, I'd love to hear how you're thinking about memory security. What attack vectors concern you most?