AI Summary • Published on Dec 23, 2025
Many abstract reasoning benchmarks, such as ARC and ARC-AGI, are widely used to evaluate AI models' "fluid intelligence" or core reasoning abilities. Despite these tasks being simple for humans, state-of-the-art vision-language models (VLMs) often struggle, leading to the assumption that machines lack strong reasoning capabilities. This paper challenges this view, proposing that the performance gap might primarily stem from limitations in visual perception rather than a deficiency in inductive reasoning itself. The challenge in verifying this hypothesis lies in the intertwined nature of perception and reasoning in these tasks.
To isolate perception from reasoning, the authors developed a two-stage experimental pipeline. In the first "perception stage," each raw image input from an ARC-style task is independently converted into a natural language description, ensuring no inductive signals leak across images. This process incorporates generic human perceptual priors like object identification and recognition of colors and shapes. In the second "reasoning stage," an AI model then uses these natural language descriptions to induce and apply rules to solve the task. Experiments were conducted on three ARC-style datasets: Mini-ARC, ACRE, and Bongard-LOGO. The pipeline was evaluated in two settings: using the same VLM for both stages and using a stronger VLM for perception with a weaker VLM for reasoning, comparing against standard end-to-end one-stage VLM performance. Error attribution was also performed by manually inspecting model outputs to categorize failures.
The two-stage pipeline significantly outperformed the standard one-stage evaluation across all three datasets, showing an 11-13 percentage point improvement in success rates. For example, on Mini-ARC, performance increased from 8.05% to 20.13%. When a stronger VLM was used for the perception stage and a weaker VLM for reasoning, the hybrid two-stage pipeline achieved performance comparable to a strong VLM in an end-to-end setting. This indicates that perceptual capability is a dominant factor. Detailed error attribution revealed that approximately 80% of model failures in the one-stage settings stemmed from perception errors (e.g., failing to identify visual objects). The majority of the performance gains observed in the two-stage pipeline were attributed to a reduction in these perception errors, rather than improvements in reasoning itself.
The findings suggest that ARC-style benchmarks conflate visual perception and inductive reasoning challenges, potentially overstating the deficiencies in AI models' reasoning abilities. The study highlights a significant perception bottleneck in these influential benchmarks and underscores the critical need for evaluation protocols that explicitly disentangle perception from reasoning when assessing progress toward general artificial intelligence. This calls for caution in interpreting current benchmark scores as direct measures of pure reasoning capability and emphasizes the importance of developing future benchmarks that more cleanly isolate specific cognitive abilities.