AI Summary • Published on Dec 3, 2025
Current Multimodal Large Language Models (MLLMs) struggle with transparent reasoning. While they often produce accurate final answers for visual tasks like Visual Question Answering and Visual Grounding, their internal decision-making processes, especially the step-by-step visual evidence that leads to a conclusion, remain largely opaque. This contrasts with human intelligence, which naturally follows a traceable chain of visual reasoning. Existing evaluation methods primarily focus on final prediction accuracy, overlooking the crucial intermediate grounding steps, thus limiting the assessment of true model understanding and interpretability.
To address the transparency gap in MLLMs, the authors introduce the Visual Reasoning Tracer (VRT) task, which requires models to explicitly localize and predict intermediate objects that form a reasoning path, alongside the final target. The VRT task is defined by a multimodal reasoning trace, a sequence of textual reasoning steps and corresponding spatially grounded segmentation masks. To facilitate research, they contribute VRT-Bench, a human-annotated benchmark for evaluating reasoning paths; VRT-80k, a large-scale dataset for training models; and novel metrics, Logical Quality (LQ) and Visual Quality (VQ), to assess the fidelity of both the reasoning steps and their visual grounding. VRT-80k is generated using a two-stage pipeline: first, dense object-caption pairs are created using models like SAM, RAM++, APE, and DAM; second, a powerful MLLM is prompted to generate complex reasoning question-answer pairs grounded in these objects. Models are trained using supervised fine-tuning (SFT) on VRT-80k, followed by reinforcement learning (RL) to enhance logical reasoning, using a reward function that includes thinking format, segmentation format, and a matching-based IoU reward.
Experiments on VRT-Bench show that while state-of-the-art MLLMs (like Gemini-2.5 Pro and Qwen3-VL) can achieve general accuracy, they significantly fail to generate valid intermediate reasoning outputs, scoring zero on R-LQ and R-VQ. In contrast, the R-Sa2VA model, specifically fine-tuned on VRT-80k, demonstrates robust performance, successfully integrating visual perception with deep reasoning across categories like Visual Details, Location, Function, and Comparison. Further analysis revealed that incorporating a stronger language thinking model (R-Sa2VA-Qwen3VL-4B-Thinking) improved functional reasoning (R-LQ increased by 4.0 points). Reinforcement Learning primarily enhanced logical reasoning (R-LQ from 66.3 to 67.0) and final answer performance (from 59.5 to 62.1), but did not significantly improve intermediate visual quality (R-VQ). Joint training with standard referring segmentation datasets showed improvements on those benchmarks but a slight decline on VRT metrics due to domain differences. Despite some noise in the VRT-80k dataset, training on it substantially boosts performance on the visual reasoning tracer benchmarks.
This work provides a foundational framework for developing and evaluating MLLMs that are more interpretable and reliable by explicitly integrating visual evidence into language reasoning. The Visual Reasoning Tracer (VRT) task, along with VRT-Bench and VRT-80k, moves beyond merely producing correct answers to enabling verifiably grounded decision-making. Future directions include expanding the scale and diversity of VRT-80k, exploring advanced training strategies (like unified segmentation and reasoning models or segmentation-aware rewards) to improve visual quality during reasoning, and building more comprehensive benchmarks with finer-grained reasoning types, multi-image settings, and cross-dataset evaluations. Addressing failure cases, particularly in dense or text-rich scenes where subtle details are missed, is also crucial for improving comprehensive scene coverage.