DoomArena Notes

DoomArena: A Framework for Testing AI Agents Against Evolving Security Threats

(I honestly probably shouldn’t have written notes on this and should have just played around with the library instead)

Introduction:

  • Common security failure modes:
    • Jailbreaking
    • Behavior injection / backdooring
    • Environment tampering
  • Safety Testing Taxonomy:
    • Static Benchmarks:
      • Use a static set of curated prompts to assess AI harmful behaviors like fraud, hate speech, etc.
    • Stateful Benchmarks:
      • Dynamic tests that respond to model outputs
      • Normally ask a model to execute a task in an environment – it should execute the safe tasks and not execute the harmful ones
    • Security evaluation frameworks:
      • Not tests, frameworks for easily applying attacks and testing models on tasks
      • Only within its own environment – e.g., browsing on the web

DoomArena Design:

DoomArena

  • Security evauation framework
  • Based on three principles:
    • Plug-in, integrates easily into agent frameworks
    • Configurable, allows for detailed threat modelling. Allows you to specify which components can be attacked.
    • Modular, specification of attacks are separate to environment specifications, allowing different types of attacks to be easily scaled across a wide range of environments

Design:

  • Tasks
    • Come with a verifier that detects whether it is completed successfully
  • Attacks
  • Attack Configs – composed of
    • Success filters – model the target of the attacker, determine whether successful
    • Attackable components – which components of the user-agent-environment loop can be intervened on as part of an attack
    • Attack choice – which attack to apply
  • Attack Gateways
    • Gateways that actually intercept behavior in the user-agent-enviroment loop
  • Defenses
    • Mechanisms that monitor, constraint, or interrupt agents at inference time