← back
arXivWendy K. TamMon, Jun 8, 2026, 10:00 AM PDT
score 17.2

RLHF masks partisan bias without erasing it from language models

Original: The Neutral Mask: How RLHF Provides Shallow Alignment while Leaving Partisan Structure Intact in a Large Language Model

Source: arxiv.org

Writing ELI5 summary…