Snowflake speeds up AI training with redundancy elimination

Who: Posted by Stas Bekman (systems engineer at Snowflake AI Research), sharing a Snowflake engineering blog post authored by the Snowflake AI Research systems and modeling teams, including Samyam Rajbhandari, Ye Wang, and roughly a dozen collaborators.

What's new: Snowflake's research team built (Zero Redundancy Rollouts), a set of engineering optimizations that cut training time by 3.5x and memory use enough to support 3.2x longer inputs. The techniques were used to train , a model that converts plain-English questions into database query code and outperforms much larger frontier models on Snowflake's internal benchmark.

How it works: The core insight is that in standard RL training, the system processes the same question prompt dozens of times — once for each candidate answer — wasting enormous computation. ZoRRo uses three techniques to eliminate this waste. First, split attention processes each unique prompt only once during training and then reuses that result across all its associated answers, rather than recomputing it from scratch each time. Second, keeps the shared prompt data in fast on-chip memory during answer generation, avoiding slow re-reads from the chip's main memory for every candidate answer. Third, uses a lightweight helper model to predict several words at once, reducing the number of full processing passes needed.

The numbers: On a cluster of 32 GPUs, training the model dropped from over five days to under 36 hours. The actor update step ran 6x faster and the answer-generation step ran 1.7x faster. The supported input length grew from 20,000 tokens to 64,000 tokens. The resulting model beat and on Snowflake's own benchmark despite being significantly smaller.

Why it matters: The optimizations are not task-specific — they apply to any RL training workload where the same prompt is paired with many candidate answers, which is the standard setup. Snowflake says it will open-source the underlying system, called Arctic RL, in the coming weeks, which would make these speedups available to any team doing similar training without requiring them to rebuild the engineering from scratch.