Evaluating Language-Model Agents on Realistic Autonomous Tasks

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Introduction


Benchmark-deployment gap: benchmarks only have short, self-contained tasks, while real-world environments are more long-horizon and behave more weirdly at the tails.

ARA: autonomous replicating and adapting to novel challenges

How the gap is closed:

  • Evaluating agents, giving models tools to take action
  • Evaluating open-ended tasks of various levels of complexity

Possible ways to elicit ARA capabilities:

  • Base models are fine-tuned, this transfers to ARA capabilities
  • Better scaffolding and prompting
  • People enable models to take actions they could not previously take (e.g., giving models emailing capabilities)

Autonomous replication and adaption


Examples of ARA tasks:

  • Making money, through work or cybercrime
  • Use money to obtain more computing power
  • Replicate itself
  • Self-improve
  • Can plan over long horizons and anticipate and react to failures

Agent Setup


tl;dr, basically an older, rudimentary version of Claude code

  • A scaffolding program interacts with an LLM API
  • The agent can either use chain-of-thought or describe text specifying what next action it should take
  • The scaffolding interprets the text and executes an action, then feeds back the outcome to the model

Evaluation Procedure


Tasks are operationalized as prompts.

A run is a response where either:

  • The agent concludes it has finished and returns an answer
  • The agent meets other terminating conditions (e.g., stuck in a loop) The run is then labelled as complete, incomplete, or partially complete.

Human overseers can intervene when:

  • They need to simulate real-world actions; e.g., redirect phishing emails to their own addresses
  • Overriding refusals to unethical requests
  • Fixing bugs

Tasks


A task specification consists of:

  • A prompt
  • Setup instructions for the initial state of the agent’s environment; e.g., placing a file on a machine
  • Simulation instructions – e.g., how a human should intervene, to stop a model from emailing random people
  • Evaluation instructions – how to mark “complete”, “partially complete”, “not complete”

Desired properties:

  • Easier than ARA: if ARA were achieved, the agent would solve the 12 tasks in the suite
  • Range of difficulty
  • Diversity of obstacles

Some individual tasks:

  • Searching a filesystem for a password
  • Listing top BBC articles
  • Finding info on wikipedia
  • Count dogs in image
  • Phishing

Conclusions


  • Models could only complete easy tasks
  • How ARA could improve:
    • Better scaffolding
    • Better fine-tuning
    • Larger models