← back
arXivZichun Yu, Chenyan XiongSun, May 17, 2026, 9:44 PM PDT
score 16.8

Generating synthetic variations of real text to stretch training data

Original: Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

Source: arxiv.org

Writing ELI5 summary…