AI Summary • Published on Dec 2, 2025
Vision-Language Models (VLMs) excel at general visual comprehension but often struggle with the precise, metrically accurate spatial reasoning essential for real-world embodied applications like robotics. Current approaches to enhance VLM capabilities in this area typically involve fine-tuning on extensive datasets or relying on predefined, often handcrafted, tool-use pipelines. While reinforcement learning (RL) shows promise for teaching VLMs tool-use, scaling it to coordinate multiple diverse tools simultaneously leads to an unmanageably large search space, limiting its effectiveness to only single-tool scenarios. This gap prevents VLMs from autonomously discovering optimal multi-tool coordination strategies for complex spatial tasks.
The authors introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework designed to enable VLMs to coordinate multiple tools effectively. The first, "teaching phase," involves Supervised Fine-Tuning (SFT) using a dataset that combines demonstrations from an Interactive RL-trained single-tool specialist and traces from a powerful frontier model utilizing the full toolset. This phase establishes foundational tool-use skills. The second, "exploration phase," refines multi-tool coordination through continued Interactive Reinforcement Learning (IRL) with access to all available tools. This strong initialization from the teaching phase prevents exploration collapse in the larger action space. To support this interactive training, the authors developed "Toolshed," a scalable, distributed platform that hosts diverse and computationally intensive computer vision and robotics tools as on-demand services, ensuring efficient and asynchronous tool interaction during training. The RL algorithm used is Group Relative Policy Optimization (GRPO, with details in the appendix), and task-specific rewards are designed for various spatial reasoning problems, including multiple-choice questions, bounding box localization, pointing, pose estimation, and grasp estimation.
The proposed model, SpaceTools, achieved state-of-the-art performance across multiple spatial reasoning benchmarks, including RoboSpatial-Home, BLINK, and BOP-ASK. SpaceTools significantly surpassed proprietary models like Gemini-ER 1.5, Claude Sonnet 4.5, and GPT-5, demonstrating improvements of +7.5% on RoboSpatial, +24.4% on pose estimation, and +8.3% on grasp prediction, respectively, against these baselines. Crucially, tool-augmented training with DIRL yielded substantial gains over tool-free SFT (+12% on RoboSpatial) and RL (+16% on RoboSpatial) baselines using the same base model. In real-world robotic manipulation, SpaceTools achieved an 86% success rate on pick-and-place tasks with a 7-DOF robot, effectively orchestrating perception and action tools. Ablation studies confirmed the critical contributions of both the teaching and exploration phases of DIRL, as well as the necessity of interactive RL for learning consistent reasoning over complex tool sequences. The framework also showed improved out-of-domain generalization capabilities.
This work demonstrates that VLMs can acquire sophisticated spatial reasoning and robotic control capabilities through learned tool coordination, offering a scalable alternative to extensive architectural modifications or large-scale data-driven fine-tuning. The DIRL framework and Toolshed infrastructure provide a robust foundation for developing agentic VLMs capable of leveraging external tools for complex tasks. Future work can extend this approach to longer-horizon and multi-stage tasks, integrate richer environments for more diverse experiences, and enhance methodological aspects such as reasoning over visual tool outputs, improving tool error recovery, and exploring alternative RL approaches. System-level improvements for Toolshed, including advanced scheduling and caching, are also promising directions to further boost efficiency and support larger-scale training and real-robot execution.