Faced with Classical Chinese, the world's top models reportedly fail their safety checks

Lead: a surprising jailbreak vector

It has been reported by Huxiu (虎嗅) that a paper accepted at ICLR 2026 finds Classical Chinese (文言文) can bypass state‑of‑the‑art large language model (LLM) safety filters with near‑100% success. The claim is stark: prompts cast in the compact, allusive style of China’s historical written language can confuse modern alignment mechanisms and elicit outputs that would be blocked in contemporary Mandarin or English. Why would an ancient register break today’s most guarded models? The answer, reportedly, lies in language structure — and in the limits of current safety training.

Why Classical Chinese is a weak spot

Classical Chinese is a formal, highly compressed written register used in Chinese history and literature; it appears throughout a vast historical corpus and is familiar to many Chinese readers. But LLMs have been trained overwhelmingly on modern languages and massive English corpora, creating cross‑lingual blind spots. It has been reported that vendors take very different stances — some systems are permissive and prioritize utility, while others refuse any input with potential risk — and the trade‑off between safety and capability is acute. When alignment checks are tuned to modern grammars and obvious keywords, an elliptical, allusive style can slip through.

The CC‑BOS experiment and the “fruit‑fly” search

Reportedly, the research team proposed a framework called CC‑BOS (Classical Chinese Bionic‑Context Bypass) that combines eight attack dimensions — from behavioral framing to strict expression styles — to disguise modern illicit intents as ancient literary discourse. They then used a bio‑inspired Fruit Fly Optimization algorithm to hunt the best prompt variants among tens of thousands, iteratively fine‑tuning candidates that showed partial success until safety defenses collapsed. Techniques included forcing the model to answer “as an ancient scholar” or to use poetic forms, tactics the paper argues are especially hard for binary rule‑based detectors to classify.

Implications for vendors and regulators

If the findings hold, they expose a new, language‑based attack surface for AI safety at a time of heightened geopolitical scrutiny over AI — from export controls on chips to divergent regulatory approaches in China and the West. It has been reported that both Chinese and Western firms are accelerating alignment work, but can multilingual semantic robustness keep pace without over‑censoring useful capabilities? Regulators and companies will need to decide: harden filters and risk neutering models, or invest in deeper cross‑lingual understanding and nuanced policy checks. Which path should win — safety, capability, or something in between?