DoomArena Notes

14 Jan 2026

(I honestly probably shouldn’t have written notes on this and should have just played around with the library instead)

Common security failure modes:
- Jailbreaking
- Behavior injection / backdooring
- Environment tampering
Safety Testing Taxonomy:
- Static Benchmarks:
  - Use a static set of curated prompts to assess AI harmful behaviors like fraud, hate speech, etc.
- Stateful Benchmarks:
  - Dynamic tests that respond to model outputs
  - Normally ask a model to execute a task in an environment – it should execute the safe tasks and not execute the harmful ones
- Security evaluation frameworks:
  - Not tests, frameworks for easily applying attacks and testing models on tasks
  - Only within its own environment – e.g., browsing on the web

DoomArena

Security evauation framework
Based on three principles:
- Plug-in, integrates easily into agent frameworks
- Configurable, allows for detailed threat modelling. Allows you to specify which components can be attacked.
- Modular, specification of attacks are separate to environment specifications, allowing different types of attacks to be easily scaled across a wide range of environments

Design:

Tasks
- Come with a verifier that detects whether it is completed successfully
Attacks
Attack Configs – composed of
- Success filters – model the target of the attacker, determine whether successful
- Attackable components – which components of the user-agent-environment loop can be intervened on as part of an attack
- Attack choice – which attack to apply
Attack Gateways
- Gateways that actually intercept behavior in the user-agent-enviroment loop
Defenses
- Mechanisms that monitor, constraint, or interrupt agents at inference time