Entity

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models

Instruction tuning of large vision-language models (LVLMs) increasingly depends on massive multimodal corpora, yet these datasets contain samples with substantial redundancy, low visual dependency, and highly imbalanced coverage of multimodal reasoning behaviors. As a result, uniform subsampling or naive score-based selection often yields suboptimal training subsets. We introduce MAGIC, a training-free, forward-only coreset selection method designed to construct compact yet behaviorally faithful

Paper · arXiv

cs.CV

Authors: Shristi Das Biswas, Kaushik Roy
Published: 2026-05-25
Categories: cs.CVcs.CL

Abstract ↗

via arXiv · 2605.26004

MAGIC: Multimodal Alignment &amp; Grounding-aware Instruction Coreset for Vision-Language Models

Paper · arXiv

MAGIC: Multimodal Alignment & Grounding-aware Instruction Coreset for Vision-Language Models