Research Guy

Problem

Vision-Language Models (VLMs) have demonstrated impressive capabilities in visual question answering, but their performance heavily relies on a large number of visual tokens, which results in substantial memory and computational costs, especially for high-resolution images. Existing approaches to improve VLM efficiency primarily focus on fixed-ratio visual token compression or limited dynamic decisions, meaning they passively reduce tokens without adapting to the specific demands of each task. This lack of adaptability raises a crucial question: how can VLMs intelligently determine the precise and minimal amount of visual information needed for a given sample, mimicking human active vision processes?

Method

The authors propose AdaptVision, a novel VLM framework that uses an adaptive, coarse-to-fine visual acquisition approach. The model initially processes a low-resolution version of an image, using a compressed set of visual tokens. If this initial information is insufficient to answer a question, AdaptVision can then selectively acquire additional, fine-grained visual information by invoking a bounding box tool to crop specific key regions from the original high-resolution image. This entire process is learned through a reinforcement learning framework designed to balance both accuracy and efficiency. A key innovation is Decoupled Turn Policy Optimization (DTPO), which addresses the challenges of ambiguous credit assignment and imbalanced optimization inherent in standard RL algorithms like GRPO. DTPO decouples the learning objective into two parts: tool learning (optimizing correct tool usage) and accuracy improvement (refining responses). Furthermore, it decouples advantage estimation by computing separate advantages for tokens associated with each objective, leading to more stable and effective training. The reward function includes an Outcome Reward (for answer correctness, format, and tool call balance) and a Tool Reward (for incentivizing effective, minimal cropping).

Results

AdaptVision demonstrates superior performance across multiple VQA benchmarks, including ChartQA, OCRBench, and MathVerse, while consuming significantly fewer visual tokens compared to state-of-the-art efficient VLM methods like FastV, SparseVLM, VisionZip, and VisionThink. Specifically, AdaptVision improves accuracy by 5.8% over a down-sample model with only a 7% increase in visual tokens. It also achieves an overall 1.67x speedup in inference time compared to vanilla high-resolution models and other dynamic methods. Ablation studies confirm the critical role of the balance reward in preventing excessive tool use and the necessity of the tool reward for effective exploration. The Decoupled Turn Policy Optimization (DTPO) algorithm proved to be crucial for stable and efficient training, avoiding the instability and over-reliance on tool calls observed with vanilla GRPO. The model adaptively invokes tools based on task complexity, using them more for complex visual tasks requiring fine details (e.g., MathVerse, ChartQA) and less for general understanding tasks (e.g., POPE).

Implications

AdaptVision represents a significant step towards more computationally efficient and biologically inspired Vision-Language Models. By enabling VLMs to autonomously and adaptively acquire visual information, the framework addresses a key limitation of existing methods and reduces computational overhead while maintaining high accuracy. This research advances the development of practical and efficient VLMs. Future work could explore expanding the visual toolset beyond just bounding boxes, allowing for dynamic resolution selection, and enabling more complex multi-turn reasoning processes to tackle even more challenging visual tasks.

Research Guy

Understand New Research — Instantly

Daily AI-generated explanations of the latest arXiv papers.

Research Guy

Research Guy

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

Problem

Method

Results

Implications