What Makes Chain-of-Thought Work at Probe Time? Local Co‑occurrence Rather Than Global Derivation

Key finding

A new arXiv paper (arXiv:2605.26795) argues that the efficacy of chain‑of‑thought (CoT) prompting at probe time is driven more by local co‑occurrence signals in the rationale than by the model performing a global, step‑wise derivation. In plain terms: when a fixed rationale is placed in the prompt, it is reportedly the nearby words and patterns that nudge the model toward the right answer, not a faithful internal recomputation of the entire chain of reasoning.

Methods and scope

The authors frame a probe‑time question that differs from prior work focused on generation behavior: given a fixed rationale in context, what properties of that text change the model’s answer? Using controlled manipulations of rationale text and probing model responses, they reportedly isolate the contribution of local lexical co‑occurrence versus global logical structure. The paper stops short of claiming that models never perform genuine multi‑step reasoning; instead it focuses on what drives answer changes when the rationale is present at query time.

Why this matters

If local co‑occurrence is indeed the dominant effect at probe time, the finding reshapes how researchers and practitioners interpret CoT gains. Prompt engineering and benchmark performance could be inflated by surface cues rather than by models acquiring deeper reasoning capabilities. That has implications for model evaluation, for attempts to use CoT as a transparency tool, and for safety auditing: how do you trust an explanation if it mainly functions as a clever cue? It has been reported that the authors discuss these broader consequences and urge more careful probes to separate signal from noise.

Broader impact

The result is a reminder that advances in large‑language models are often as much about clever context design as about internal algorithmic leaps. For developers and policymakers asking whether CoT represents a step toward human‑like reasoning, the answer may be more nuanced. Does a model “understand” a chain of thought, or does it just pick up on the right words? This paper adds an important piece to that debate and points to more rigorous, probe‑based diagnostics as the next step.