Entity

Moment-Video: Diagnosing Temporal Fidelity of Video MLLMs on Momentary Visual Events

Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failu

Paper · arXiv

cs.CV

Authors: Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang, Yan Li + 7 more
Published: 2026-06-01
Categories: cs.CVcs.AI

Abstract ↗

via arXiv · 2606.02522