Self-play training for language models fails due to data filtering, not reward design

Original: 🚨 Why does Self-Play RL for LLMs keep collapsing? Most fixes focus on the reward signal. In our new paper "Survive or Collapse", we show that's the wrong lever. The true binding constraint is actuall

Source: x.com ↗

Writing ELI5 summary…