← back
arXivYongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo YeSat, May 30, 2026, 4:03 PM PDT
score 15.7

LLMs produce inconsistent code across repeated runs, hiding real quality gaps

Original: Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

Source: arxiv.org

Writing ELI5 summary…