arXivYulin Zhao, Yun Wang, Dehua Zheng, Borui jiang, Zheng ZhangWed, May 20, 2026, 2:37 AM PDT
score 17.0
Vision-language models run 2.5x faster by trimming unused image tokens
Original: Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models
Source: arxiv.org ↗
Writing ELI5 summary…