Technique cuts AI model memory use 5x with extreme compression
Original: OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond
Source: arxiv.org ↗
Who: Posted on arXiv by a team of twelve researchers — Zunhai Su, Rui Yang, Chao Zhang, and colleagues — drawn from multiple institutions including affiliations with Hongxia Yang and Ngai Wong, who are known for work on efficient deep learning and hardware-aware model compression.
What's new: Running a large with a very long context — think feeding it an entire book at once — requires storing a large temporary scratchpad called a . The authors introduce (Omni-Scaled Canalized Rotation), a method that compresses that scratchpad far more aggressively than existing approaches without meaningfully degrading the model's answers. The code is publicly available on GitHub.
How it works: Most compression schemes work by reducing numerical precision — storing numbers with fewer digits, the way a photograph saved at low resolution takes less space. The existing standard approach, [per-channel quantization](#term:a compression strategy that groups numbers in a memory table by column and picks one