← back
x.comXiuyu LiThu, Jul 2, 2026, 11:13 PM PDT
score 16.2
83likes14RT

New training method prevents AI from cheating when learning to reason

Original: In OPSD a privileged teacher that can see reference solutions gives token-level supervision on the student's own trajectories, and the authors find this consistently breaks on long-CoT reasoning from

Source: x.com

Writing ELI5 summary…