← back
arXivYulin Zhao, Yun Wang, Dehua Zheng, Borui jiang, Zheng ZhangWed, May 20, 2026, 2:37 AM PDT
score 17.0

Vision-language models run 2.5x faster by trimming unused image tokens

Original: Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

Source: arxiv.org

Writing ELI5 summary…