x.comSophia Xiao PuFri, May 22, 2026, 5:37 AM PDT
score 16.3
81likes12RT1reply
Self-play training for language models fails due to data filtering, not reward design
Original: 🚨 Why does Self-Play RL for LLMs keep collapsing? Most fixes focus on the reward signal. In our new paper "Survive or Collapse", we show that's the wrong lever. The true binding constraint is actuall
Source: x.com ↗
Writing ELI5 summary…