Entity

DRScaffold: Boosting Dense-Scene Reasoning in Lightweight Vision Language Models

Lightweight vision-language models perform competitively on standard benchmarks yet fail systematically in dense-scene reasoning, where multiple objects, attributes, and relations must be jointly grounded and resolved through multi-step inference. Such capability is critical for real-world applications where models must reliably interpret cluttered environments. Yet existing training signals provide no explicit grounding between reasoning steps and the underlying visual entities and relations, l

Paper · arXiv

cs.CV

Authors: Xinrui Shi, Kai Liu, Ziqing Zhang, Jianze Li, Anqi Li + 1 more
Published: 2026-05-25
Categories: cs.CVcs.AI

Abstract ↗

via arXiv · 2605.26038