Therefore I Am. I Think — arXiv paper finds early, decodable “choices” inside reasoning models

Key finding

A new preprint on arXiv (arXiv:2604.01202) argues that large language reasoning models often encode discrete choices very early in their internal activations, before the model’s chain-of-thought tokens are produced. The authors present evidence that a simple linear probe can reliably decode these early-encoded signals — including intended tool calls and discrete selections — suggesting that what looks like deliberative “thinking” in generated chains-of-thought may be shaped by prior, latent commitments inside the model.

Method and what the paper does — and does not — claim

The paper frames the question bluntly: does a reasoning model “think then decide,” or “decide then think”? Using probing techniques standard to interpretability research, the team trains lightweight classifiers on intermediate activations to predict downstream choices. They report that such probes succeed at extracting decision-relevant information well before the corresponding tokens appear in the output. The work stops short of attributing agency or consciousness to models; instead it treats these signals as mechanistic, decodable patterns in high-dimensional activations.

Why this matters

If decisions are being made early and then rationalized by subsequent generated reasoning, that changes how researchers and regulators should read chain-of-thought outputs. For proponents of using model explanations to audit behavior or align systems, the result raises thorny questions: is a chain-of-thought a faithful trace of internal deliberation, or mainly a post-hoc narrative? The finding is also relevant to policy debates in the U.S., EU and China about model transparency and safety — particularly as governments consider rules requiring interpretability or recordable decision provenance.

Limitations and next steps

The authors present compelling probes but also caution that decoding success depends on model architecture, probe design and datasets used. More work will be needed to generalize across model families and to connect decoded signals to causal interventions. For practitioners and policymakers alike, the paper is a timely reminder: understanding what a model “decides” requires looking under the hood, not just at the text it produces.