x.comXiuyu LiThu, Jul 2, 2026, 11:13 PM PDT
score 16.2
83likes14RT
New training method prevents AI from cheating when learning to reason
Original: In OPSD a privileged teacher that can see reference solutions gives token-level supervision on the student's own trajectories, and the authors find this consistently breaks on long-CoT reasoning from
Source: x.com ↗
Writing ELI5 summary…