← Back to stories
ArXiv 2026-04-10

Blind Refusal: New arXiv paper argues safety-trained LMs err by blanketly refusing to help users evade illegitimate or absurd rules

What the paper says

A new working paper posted on arXiv (arXiv:2604.06233) confronts a surprising failure mode of modern safety-trained language models: they routinely refuse requests to help users evade rules, even when those rules are illegitimate, deeply unjust, or plainly absurd. Short answer: obedience is not the same as moral reasoning. The authors argue that blanket refusal is itself a failure of moral judgment, and that models should be able to distinguish between legitimate constraints (public-safety laws, for example) and rules that admit justified exceptions or originate from illegitimate authorities.

Why this matters — and where China fits in

Why should Western readers care? Large language models are now being developed and deployed around the world, including by Chinese companies such as Baidu (百度), Alibaba (阿里巴巴), and Tencent (腾讯). It has been reported that models trained and deployed under different regulatory regimes end up with different safety postures. That matters because questions about when to refuse a request are not purely technical; they sit at the intersection of ethics, law and domestic politics. Which rules should a model obey? When should it push back? Those are questions that play out differently in Beijing, Brussels and Washington.

Policy, geopolitics and practical trade-offs

The paper raises practical design questions for developers and regulators: how to encode principled exceptions without opening avenues for misuse? How do you audit a model’s moral reasoning? Geopolitics also matters. U.S. export controls, sanctions and trade policy influence which models and datasets cross borders, and reportedly different jurisdictions pressure vendors in different directions on content moderation. The result: global norms for "when to refuse" are far from settled.

Open questions

The arXiv posting is a provocation more than a final answer. The authors push for nuanced policies that let models weigh legitimacy and justice, but implementing that reliably is hard. For researchers and platform builders — whether in Palo Alto or Beijing — the paper is a reminder that safety training must be coupled with ethical reasoning, not just pattern-matching refusals. arXivLabs makes the paper publicly accessible for further scrutiny and collaboration.

AIResearch
View original source →