Evaluating Language-Model Agents on Realistic Autonomous Tasks
Introduction
Benchmark-deployment gap: benchmarks only have short, self-contained tasks, while real-world environments are more long-horizon and behave more weirdly at the tails.
ARA: autonomous replicating and adapting to novel challenges
How the gap is closed:
- Evaluating agents, giving models tools to take action
- Evaluating open-ended tasks of various levels of complexity
Possible ways to elicit ARA capabilities:
- Base models are fine-tuned, this transfers to ARA capabilities
- Better scaffolding and prompting
- People enable models to take actions they could not previously take (e.g., giving models emailing capabilities)
Autonomous replication and adaption
Examples of ARA tasks:
- Making money, through work or cybercrime
- Use money to obtain more computing power
- Replicate itself
- Self-improve
- Can plan over long horizons and anticipate and react to failures
Agent Setup
tl;dr, basically an older, rudimentary version of Claude code
- A scaffolding program interacts with an LLM API
- The agent can either use chain-of-thought or describe text specifying what next action it should take
- The scaffolding interprets the text and executes an action, then feeds back the outcome to the model
Evaluation Procedure
Tasks are operationalized as prompts.
A run is a response where either:
- The agent concludes it has finished and returns an answer
- The agent meets other terminating conditions (e.g., stuck in a loop) The run is then labelled as complete, incomplete, or partially complete.
Human overseers can intervene when:
- They need to simulate real-world actions; e.g., redirect phishing emails to their own addresses
- Overriding refusals to unethical requests
- Fixing bugs
Tasks
A task specification consists of:
- A prompt
- Setup instructions for the initial state of the agent’s environment; e.g., placing a file on a machine
- Simulation instructions – e.g., how a human should intervene, to stop a model from emailing random people
- Evaluation instructions – how to mark “complete”, “partially complete”, “not complete”
Desired properties:
- Easier than ARA: if ARA were achieved, the agent would solve the 12 tasks in the suite
- Range of difficulty
- Diversity of obstacles
Some individual tasks:
- Searching a filesystem for a password
- Listing top BBC articles
- Finding info on wikipedia
- Count dogs in image
- Phishing
Conclusions
- Models could only complete easy tasks
- How ARA could improve:
- Better scaffolding
- Better fine-tuning
- Larger models