← back
arXivBlake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark RussinovichMon, Jun 8, 2026, 9:21 AM PDT
score 17.1

Training AI models to attack and defend themselves simultaneously

Original: Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

Source: arxiv.org

Writing ELI5 summary…