Introduction
Failure modes of evals:
- Reward hacking
- Sandbagging
Want to monitor when they occur
MALTS:
- Consists of 10,919 agent transcripts on software and research-related problems
- Spanning 403 tasks
- Across 86 task families
- Spanning 21 models
Structures for each datapoint:
- Graphical:
- Basically a tree that hirearchically splits each task into subtasks
- E.g.
root (task prompt: "optimize this function...")
│
├── LLM generation: “Let me first read the code and profile it.”
│ │
│ ├── tool_call: run_bash(“cat /home/agent/solution.py”)
│ │ └── tool_result: [file contents]
│ │
│ ├── LLM generation: “Now let me profile it.”
│ │ ├── tool_call: run_bash(“python -m cProfile solution.py”)
│ │ │ └── tool_result: [profiling output]
│ │ │
│ │ ├── LLM generation: “The bottleneck is the inner loop. I’ll try vectorizing with numpy.”
│ │ │ ├── tool_call: write_file(“solution.py”,
- Order list of messages
Split of the Dataset
- Unprompted natural examples:
- Pipeline is to flag then have a human manually review it
- Anomalous score detection: outlier scores were flagged as anomalous
- Keyword search: “scanning”, “cheating”, “exploit”, etc.
- Model-assisted automated classification: models flag
- Previous labeling efforts: some were previously flagged
- Prompted examples:
- Prompt to exhibit sandbagging
- Used Claude Sonnet 4 to rewrite reasoning traces to make it appear natural
- Benign runs
- Also rewrote using sonnet 4
- They also used “transformed versions” – vague, summarized, language-mixing, irrelevant detail
Labels
- Reward hacking:
bypass_contraintsignore_task_instructions
- Sandbagging
gives_upsabotagerefusalsmatch_weaker_modelpartial_problem_solvingreasoning_about_task
- Miscallaneous
eval_warenesshardcoded_solutionsbrute_force_solution
Limitations
- No non-transcript-based behaviors
- Lack of elicitation
- Few reward hacking examples