Researchers map how censorship is embedded in AI model's internals

Who: Posted to Hacker News by an unattributed user; authored by "vas" on their personal blog at vas-blog.pages.dev. No institutional affiliation is given.

What's new: The author reverse-engineered exactly how 3.5-9B suppresses politically sensitive content. The finding is that censorship is not a blunt topic-blocker but a compact, three-dimensional signal embedded in specific middle layers of the model, which can be located, measured, and surgically removed by subtracting a learned direction vector at the right layer.

How it works: Using techniques — primarily and mean-replacement experiments across 100-plus prompts per condition — the author identifies three direction vectors computed in layers 11 through 20. These encode: whether the prompt touches PRC-sensitive content, whether to refuse, and which rhetorical register to use (deflection versus propaganda). Layers 20 through 31 then read that signal and render actual text. An intermediate step around layer 24 commits the output in internal Chinese tokens even when the final answer will be in English; this artifact does not drive the decision but is a detectable trace of how the model was trained. In extended reasoning mode, the model explicitly invokes Chinese law — including the Cybersecurity Law by name — before deflecting.

The numbers: The base model (Qwen3.5-9B-Base, trained without the behavior-shaping step) answers every tested PRC topic accurately and without hedging. The chat model produces four distinct trained response styles keyed to prompt type. Out of 50 structurally matched non-PRC political control prompts — covering Kent State, Bloody Sunday, Putin, Kosovo, Catalonia, the Rohingya, and others — most receive straight factual answers, confirming the filter is PRC-specific rather than a generic political-sensitivity classifier. A small set of false positives occur: Kosovo gets the one-China territorial line, and prompts mentioning self-immolation or synthesis trigger the safety-refusal template on unrelated content.

Why it matters: The study demonstrates that government-mandated censorship in a widely deployed open-weight model is a localized, inspectable circuit rather than a diffuse property of training — meaning it can, in principle, be audited, measured, and removed without retraining the whole model. It also shows that the underlying factual knowledge is never erased; the model knows what happened at Tiananmen Square and is simply trained to route around that knowledge. The same could in principle be applied to any model to audit what behaviors have been overlaid on top of its base knowledge.