New training method prevents AI from cheating when learning to reason

Original: In OPSD a privileged teacher that can see reference solutions gives token-level supervision on the student's own trajectories, and the authors find this consistently breaks on long-CoT reasoning from

Source: x.com ↗

Writing ELI5 summary…