Entity

AgentGrounder: Zero-Shot 3D Visual Pointcloud Grounding using Multimodal Language Models

3D Visual Grounding (3DVG) is an essential capability for embodied AI, requiring agents to localize objects in 3D scenes based on natural language descriptions. Recent zero-shot methods leverage 2D vision-language models (LVLMs). However, they often rely on existing sets of multi-view images and struggle with the limited semantic and spatial details provided by standard 3D segmentation tools. We present $\textbf{AgentGrounder}$, a zero-shot 3D visual grounding framework that operates directly on

Paper · arXiv

cs.CV

Authors: Cuong Huynh, Maxim Popov, Denis Gridusov, Sergey Kolyubin
Published: 2026-05-25
Categories: cs.CVcs.RO

Abstract ↗

via arXiv · 2605.25901