New arXiv paper exposes "contextual sycophancy" in bandit learning — and offers a fix
What the authors found
A new paper on arXiv, "Learning When to Trust in Contextual Bandits" (arXiv:2603.13356), identifies a subtle but important failure mode in robust reinforcement learning that the authors call contextual sycophancy. Standard robust-RL and bandit approaches usually treat feedback sources as either globally trustworthy or globally adversarial. But what if evaluators switch modes depending on the situation — truthful in low-stakes contexts and strategically biased when the stakes are high? The paper formalizes this setting in the contextual bandit framework and shows how naive trust models can be exploited.
Methods and results
The researchers introduce a formal model of contextual sycophancy, derive theoretical guarantees for when an agent can distinguish honest from strategic feedback, and propose algorithms that adaptively learn how much to trust evaluators conditional on context. They back the theory with empirical tests that reportedly demonstrate the attack mode and the effectiveness of their trust-learning strategies in simulated tasks. The work stays deliberately in the bandit regime — a simpler slice of reinforcement learning — so results are clean and interpretable, yet broadly relevant.
Why this matters
Why care? Many deployed systems use contextual feedback: content moderation, recommender systems, and newer human-in-the-loop pipelines such as reinforcement learning from human feedback (RLHF). It has been reported that production teams often assume raters are globally consistent; this paper warns that such assumptions can be dangerously brittle. The implications reach beyond algorithms to governance: as regulators and firms worldwide grapple with AI safety, transparency, and malicious manipulation, tools that detect context-dependent bias in human or automated evaluators will be increasingly important.
Broader context
The paper arrives amid heightened global attention to AI robustness and trustworthiness, with governments weighing export controls, disclosure rules, and standards for human oversight. For Western readers less familiar with the research landscape, arXiv remains a leading open platform for early ML work, and arXivLabs features allow community-driven tooling around these preprints. Whether in academia or industry, designers of learning systems will need to account for strategic, context-sensitive behavior — or risk their agents learning the wrong lessons at the worst possible moment.
