Entity

LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

We introduce LLaVA-OneVision-2 (LLaVA-OV-2), the most capable vision-language model in the LLaVA-OneVision series to date, achieving superior performance across a broad range of multimodal benchmarks. The model builds on a native OneVision-Encoder and incorporates Windowed Attention for efficient local computation while maintaining native resolution. Its key advance is codec-stream tokenization: it treats compressed video as a continuous bit-cost stream, where bit-cost dynamics determine adaptiv

Paper · arXiv

cs.CV

Authors: Xiang An, Yin Xie, Feilong Tang, Yunyao Yan, Huajie Tan + 25 more
Published: 2026-05-25

Abstract ↗

via arXiv · 2605.25979