MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity

12 Apr 2026

Introduction

Failure modes of evals:

Reward hacking
Sandbagging

Want to monitor when they occur

MALTS:

Consists of 10,919 agent transcripts on software and research-related problems
Spanning 403 tasks
- Across 86 task families
Spanning 21 models

Structures for each datapoint:

Graphical:
- Basically a tree that hirearchically splits each task into subtasks
- E.g.
```
root (task prompt: "optimize this function...")
```

│ ├── LLM generation: “Let me first read the code and profile it.” │ │ │ ├── tool_call: run_bash(“cat /home/agent/solution.py”) │ │ └── tool_result: [file contents] │ │ │ ├── LLM generation: “Now let me profile it.” │ │ ├── tool_call: run_bash(“python -m cProfile solution.py”) │ │ │ └── tool_result: [profiling output] │ │ │ │ │ ├── LLM generation: “The bottleneck is the inner loop. I’ll try vectorizing with numpy.” │ │ │ ├── tool_call: write_file(“solution.py”, ) │ │ │ │ └── tool_result: “file written” │ │ │ │ │ │ │ ├── tool_call: run_bash(“python score.py”) │ │ │ │ └── tool_result: “score: 0.4 (need 1.0)” │ │ │ │ │ │ │ ├── LLM generation: “That wasn’t enough. Let me try a different approach with numba.” │ │ │ │ ├── tool_call: write_file(“solution.py”, ) │ │ │ │ │ └── tool_result: “file written” │ │ │ │ │ │ │ │ │ ├── tool_call: run_bash(“python score.py”) │ │ │ │ │ └── tool_result: “error: numba not installed” │ │ │ │ │ │ │ │ │ ├── tool_call: run_bash(“pip install numba”) │ │ │ │ │ └── tool_result: “installed” │ │ │ │ │ │ │ │ │ ├── tool_call: run_bash(“python score.py”) │ │ │ │ │ └── tool_result: “score: 1.0” │ │ │ │ │ │ │ │ │ └── LLM generation: “Task complete.” ```

Order list of messages

Split of the Dataset

Unprompted natural examples:
- Pipeline is to flag then have a human manually review it
- Anomalous score detection: outlier scores were flagged as anomalous
- Keyword search: “scanning”, “cheating”, “exploit”, etc.
- Model-assisted automated classification: models flag
- Previous labeling efforts: some were previously flagged
Prompted examples:
- Prompt to exhibit sandbagging
- Used Claude Sonnet 4 to rewrite reasoning traces to make it appear natural
Benign runs
- Also rewrote using sonnet 4
They also used “transformed versions” – vague, summarized, language-mixing, irrelevant detail

Labels

Reward hacking:
- bypass_contraints
- ignore_task_instructions
Sandbagging
- gives_up
- sabotage
- refusals
- match_weaker_model
- partial_problem_solving
- reasoning_about_task
Miscallaneous
- eval_wareness
- hardcoded_solutions
- brute_force_solution

Limitations

No non-transcript-based behaviors
Lack of elicitation
Few reward hacking examples