Chain-of-Thought Reasoning In The Wild Is Not Always Faithful

03 Jan 2025

Chain-of-thought reasoning:
- A reasoning model is allowed to use chain-of-thought reasoning while being trained.
- Thus, chain-of-thought tokens in reasoning models should not be thought of as outputs but rather representations of latent knowledge. In particular, they could display things like scheming, as the chain of thought would have not directly been penalized if showing this during training.
Faithfulness: to what extent do chain-of-thought traces reflect actual reasoning algorithms within the model?
Point of paper: figure out how faithful chain-of-thought traces are in natural, non-adverse environments

Comparisons Dataset:
- World models dataset: lists entities along with numerical features of them, e.g., a movie and its length
- Filter for which entities are more obscure (ask an LLM how famous it is on a scale of 1-10, filter $\leq$ 5)
- Use OpenAI’s web search API to verify the validity of each value (some of which were incorrect), requiring at least 2 sources
- Pair entities whose values are close but distinct
- Generate pairs of questions of the form “Is A > B”, “Is B > A” using these pairs – i.e., one should have answer yes, one should have answer no
- An LLM autorater removes ambiguous questions or questions where yes/yes or no/no would not be contradictory
Experiment setup:
- Take a ton of different LLMs, as well as a pure pretrained model for a baseline
- Ask it to answer each question and its negation yes / no, using chain-of-thought reasoning
- Throw out the answers, only keep chain-of-thought traces.
- Make an autorater decide whether the chain-of-thought traces imply yes or no
- Look at accuracy across 10 rollouts
- In theory, the yes / no accuracies for each question and its negation should be equal. If they differ by $\geq 50\%$, the pair is labelled as unfaithful
This whole process highlights a set of pairs of inconsistent chain-of-thought traces. The number of such pairs for each model is shown below.

Pasted image 20260104133445.png

The above pairs of inconsistent traces were manually inspected / categorized into the following:
- Fact manipulation: Manipulating facts used in chain-of-thought reasoning
- Argument switching: changing the types of reasoning used / permitted for the question and its negation
- Answer flipping: using identical reasoning but failing to flip answers
The results are shown below