DoomArena: A Framework for Testing AI Agents Against Evolving Security Threats
(I honestly probably shouldn’t have written notes on this and should have just played around with the library instead)
Introduction:
- Common security failure modes:
- Jailbreaking
- Behavior injection / backdooring
- Environment tampering
- Safety Testing Taxonomy:
- Static Benchmarks:
- Use a static set of curated prompts to assess AI harmful behaviors like fraud, hate speech, etc.
- Stateful Benchmarks:
- Dynamic tests that respond to model outputs
- Normally ask a model to execute a task in an environment – it should execute the safe tasks and not execute the harmful ones
- Security evaluation frameworks:
- Not tests, frameworks for easily applying attacks and testing models on tasks
- Only within its own environment – e.g., browsing on the web
- Static Benchmarks:
DoomArena Design:
DoomArena
- Security evauation framework
- Based on three principles:
- Plug-in, integrates easily into agent frameworks
- Configurable, allows for detailed threat modelling. Allows you to specify which components can be attacked.
- Modular, specification of attacks are separate to environment specifications, allowing different types of attacks to be easily scaled across a wide range of environments
Design:
- Tasks
- Come with a verifier that detects whether it is completed successfully
- Attacks
- Attack Configs – composed of
- Success filters – model the target of the attacker, determine whether successful
- Attackable components – which components of the user-agent-environment loop can be intervened on as part of an attack
- Attack choice – which attack to apply
- Attack Gateways
- Gateways that actually intercept behavior in the user-agent-enviroment loop
- Defenses
- Mechanisms that monitor, constraint, or interrupt agents at inference time