Evaluating Language-Model Agents on Realistic Autonomous Tasks

10 Apr 2026

Evaluating Language-Model Agents on Realistic Autonomous Tasks

Introduction

Benchmark-deployment gap: benchmarks only have short, self-contained tasks, while real-world environments are more long-horizon and behave more weirdly at the tails.

ARA: autonomous replicating and adapting to novel challenges

How the gap is closed:

Evaluating agents, giving models tools to take action
Evaluating open-ended tasks of various levels of complexity

Possible ways to elicit ARA capabilities:

Base models are fine-tuned, this transfers to ARA capabilities
Better scaffolding and prompting
People enable models to take actions they could not previously take (e.g., giving models emailing capabilities)

Autonomous replication and adaption

Examples of ARA tasks:

Making money, through work or cybercrime
Use money to obtain more computing power
Replicate itself
Self-improve
Can plan over long horizons and anticipate and react to failures

Agent Setup

tl;dr, basically an older, rudimentary version of Claude code

A scaffolding program interacts with an LLM API
The agent can either use chain-of-thought or describe text specifying what next action it should take
The scaffolding interprets the text and executes an action, then feeds back the outcome to the model

Evaluation Procedure

Tasks are operationalized as prompts.

A run is a response where either:

The agent concludes it has finished and returns an answer
The agent meets other terminating conditions (e.g., stuck in a loop) The run is then labelled as complete, incomplete, or partially complete.

Human overseers can intervene when:

They need to simulate real-world actions; e.g., redirect phishing emails to their own addresses
Overriding refusals to unethical requests
Fixing bugs

Tasks

A task specification consists of:

A prompt
Setup instructions for the initial state of the agent’s environment; e.g., placing a file on a machine
Simulation instructions – e.g., how a human should intervene, to stop a model from emailing random people
Evaluation instructions – how to mark “complete”, “partially complete”, “not complete”

Desired properties:

Easier than ARA: if ARA were achieved, the agent would solve the 12 tasks in the suite
Range of difficulty
Diversity of obstacles

Some individual tasks:

Searching a filesystem for a password
Listing top BBC articles
Finding info on wikipedia
Count dogs in image
Phishing

Conclusions

Models could only complete easy tasks
How ARA could improve:
- Better scaffolding
- Better fine-tuning
- Larger models