arXivYongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo YeSat, May 30, 2026, 4:03 PM PDT
score 15.7
LLMs produce inconsistent code across repeated runs, hiding real quality gaps
Original: Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
Source: arxiv.org ↗
Writing ELI5 summary…