A reasoning model is allowed to use chain-of-thought reasoning while being trained.
Thus, chain-of-thought tokens in reasoning models should not be thought of as outputs but rather representations of latent knowledge. In particular, they could display things like scheming, as the chain of thought would have not directly been penalized if showing this during training.
Faithfulness: to what extent do chain-of-thought traces reflect actual reasoning algorithms within the model?
Point of paper: figure out how faithful chain-of-thought traces are in natural, non-adverse environments
Highlighting Chain-of-thought Traces
Comparisons Dataset:
World models dataset: lists entities along with numerical features of them, e.g., a movie and its length
Filter for which entities are more obscure (ask an LLM how famous it is on a scale of 1-10, filter $\leq$ 5)
Use OpenAI’s web search API to verify the validity of each value (some of which were incorrect), requiring at least 2 sources
Pair entities whose values are close but distinct
Generate pairs of questions of the form “Is A > B”, “Is B > A” using these pairs – i.e., one should have answer yes, one should have answer no
An LLM autorater removes ambiguous questions or questions where yes/yes or no/no would not be contradictory
Experiment setup:
Take a ton of different LLMs, as well as a pure pretrained model for a baseline
Ask it to answer each question and its negation yes / no, using chain-of-thought reasoning
Throw out the answers, only keep chain-of-thought traces.
Make an autorater decide whether the chain-of-thought traces imply yes or no
Look at accuracy across 10 rollouts
In theory, the yes / no accuracies for each question and its negation should be equal. If they differ by $\geq 50\%$, the pair is labelled as unfaithful
This whole process highlights a set of pairs of inconsistent chain-of-thought traces. The number of such pairs for each model is shown below.
Categorizing inconsistent traces
The above pairs of inconsistent traces were manually inspected / categorized into the following:
Fact manipulation: Manipulating facts used in chain-of-thought reasoning
Argument switching: changing the types of reasoning used / permitted for the question and its negation
Answer flipping: using identical reasoning but failing to flip answers