Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

Key finding

A new arXiv preprint (arXiv:2605.26772) argues that chain-of-thought (CoT) reasoning in large reasoning models (LRMs) breaks a common assumption used to steer model behavior: that refusal can be captured and controlled by a single linear direction in internal activations. The paper shows that, unlike instruction‑tuned large language models where refusal often aligns with a single directional subspace, LRMs produce dynamic intermediate traces (CoT) that change the model’s internal state token by token. The result: a single linear steering vector that flips outputs in one-shot may fail once the model is allowed to “think” through a chain of intermediate steps.

Why it matters

The practical consequence is straightforward and worrying for safety engineers and red teams. If refusal depends on a sequence of internal states rather than a fixed subspace, standard steering and projection techniques can be transiently bypassed by inducing particular CoT trajectories. How do you make a model refuse when its thoughts keep changing? The authors demonstrate failure modes and argue that new steering mechanisms will need to account for token-level dynamics and the internal trajectories that give rise to final outputs.

Industry and policy context

This is not just an academic quibble. LRMs are being pursued by research groups and companies worldwide, and the mismatch between control assumptions and model dynamics could complicate deployment and auditing. Geopolitics also matters: tighter export controls on advanced AI accelerators and ongoing debates about regulation shape who can train and iterate on such models, and therefore who is most exposed to these control challenges. Researchers say the findings increase the urgency for more robust interpretability tools and for safety evaluations that probe intermediate reasoning traces, not only final answers.

Next steps

The paper calls for methods that explicitly model or constrain chain‑of‑thought trajectories rather than relying on single-direction steering, and for benchmarks that test refusal under dynamic internal states. These are necessary next steps if LRMs are to be used safely in high-stakes settings. The full preprint is available on arXiv (https://arxiv.org/abs/2605.26772) for researchers and practitioners who want to drill into the experiments and proposed mitigations.