MALT: A Dataset of Natural and Prompted Behaviors That Threaten Eval Integrity

Introduction


Failure modes of evals:

  • Reward hacking
  • Sandbagging

Want to monitor when they occur

MALTS:

  • Consists of 10,919 agent transcripts on software and research-related problems
  • Spanning 403 tasks
    • Across 86 task families
  • Spanning 21 models

Structures for each datapoint:

  • Graphical:
    • Basically a tree that hirearchically splits each task into subtasks
    • E.g.
      root (task prompt: "optimize this function...")

│ ├── LLM generation: “Let me first read the code and profile it.” │ │ │ ├── tool_call: run_bash(“cat /home/agent/solution.py”) │ │ └── tool_result: [file contents] │ │ │ ├── LLM generation: “Now let me profile it.” │ │ ├── tool_call: run_bash(“python -m cProfile solution.py”) │ │ │ └── tool_result: [profiling output] │ │ │ │ │ ├── LLM generation: “The bottleneck is the inner loop. I’ll try vectorizing with numpy.” │ │ │ ├── tool_call: write_file(“solution.py”, ) │ │ │ │ └── tool_result: “file written” │ │ │ │ │ │ │ ├── tool_call: run_bash(“python score.py”) │ │ │ │ └── tool_result: “score: 0.4 (need 1.0)” │ │ │ │ │ │ │ ├── LLM generation: “That wasn’t enough. Let me try a different approach with numba.” │ │ │ │ ├── tool_call: write_file(“solution.py”, ) │ │ │ │ │ └── tool_result: “file written” │ │ │ │ │ │ │ │ │ ├── tool_call: run_bash(“python score.py”) │ │ │ │ │ └── tool_result: “error: numba not installed” │ │ │ │ │ │ │ │ │ ├── tool_call: run_bash(“pip install numba”) │ │ │ │ │ └── tool_result: “installed” │ │ │ │ │ │ │ │ │ ├── tool_call: run_bash(“python score.py”) │ │ │ │ │ └── tool_result: “score: 1.0” │ │ │ │ │ │ │ │ │ └── LLM generation: “Task complete.” ```

  • Order list of messages

Split of the Dataset


  • Unprompted natural examples:
    • Pipeline is to flag then have a human manually review it
    • Anomalous score detection: outlier scores were flagged as anomalous
    • Keyword search: “scanning”, “cheating”, “exploit”, etc.
    • Model-assisted automated classification: models flag
    • Previous labeling efforts: some were previously flagged
  • Prompted examples:
    • Prompt to exhibit sandbagging
    • Used Claude Sonnet 4 to rewrite reasoning traces to make it appear natural
  • Benign runs
    • Also rewrote using sonnet 4
  • They also used “transformed versions” – vague, summarized, language-mixing, irrelevant detail

Labels


  • Reward hacking:
    • bypass_contraints
    • ignore_task_instructions
  • Sandbagging
    • gives_up
    • sabotage
    • refusals
    • match_weaker_model
    • partial_problem_solving
    • reasoning_about_task
  • Miscallaneous
    • eval_wareness
    • hardcoded_solutions
    • brute_force_solution

Limitations


  • No non-transcript-based behaviors
  • Lack of elicitation
  • Few reward hacking examples