← back
arXivZunhai Su, Rui Yang, Chao Zhang, Yaxiu Liu, Yifan Zhang, Wei Wu, Jing Xiong, Dayou Du, Xialie Zhuang, Yulei Qian, Yuchen Xie, Yik-Chung Wu, Hongxia Yang, Ngai WongTue, May 19, 2026, 3:53 AM PDT
score 16.3

Technique cuts AI model memory use 5x with extreme compression

Original: OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Source: arxiv.org

Who: Posted on arXiv by a team of twelve researchers — Zunhai Su, Rui Yang, Chao Zhang, and colleagues — drawn from multiple institutions including affiliations with Hongxia Yang and Ngai Wong, who are known for work on efficient deep learning and hardware-aware model compression.

What's new: Running a large with a very long context — think feeding it an entire book at once — requires storing a large temporary scratchpad called a . The authors introduce (Omni-Scaled Canalized Rotation), a method that compresses that scratchpad far more aggressively than existing approaches without meaningfully degrading the model's answers. The code is publicly available on GitHub.

How it works: Most compression schemes work by reducing numerical precision — storing numbers with fewer digits, the way a photograph saved at low resolution takes less space. The existing standard approach, [per-channel quantization](#term:a compression strategy that groups numbers in a memory table by column and picks one