All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 2 results for this tag.
From Accuracy to Readiness: Metrics and Benchmarks for Human-AI Decision-Making
This paper introduces a novel measurement framework to evaluate human-AI decision-making, shifting focus from mere model accuracy to the readiness of human-AI teams for safe and effective collaboration. It proposes a taxonomy of metrics and connects them to the Understand–Control–Improve lifecycle to assess calibration, error recovery, and governance in real-world deployments.
Your Reasoning Benchmark May Not Test Reasoning: Revealing Perception Bottleneck in Abstract Reasoning Benchmarks
This paper challenges the common interpretation of AI models' performance on abstract reasoning benchmarks like ARC, hypothesizing that visual perception limitations, not reasoning deficiencies, are the primary bottleneck. It introduces a two-stage pipeline to separate perception and reasoning, revealing that most model failures stem from perception errors and demonstrating significant performance improvements.