AI Summary • Published on Mar 18, 2026
The increasing capabilities of large language models (LLMs) and AI agents have significantly automated parts of the data science workflow. However, there is a lack of clarity regarding how well AI agents perform compared to human experts on domain-specific data science tasks, and in what areas human expertise remains crucial. Existing benchmarks often focus on generic code generation and lack the complexity of real-world problems that demand domain-specific knowledge and multimodal data integration. This gap limits the understanding of effective human-AI collaboration and the true potential of AI in specialized data science applications.
The authors introduce AgentDS, a comprehensive benchmark and competition designed to evaluate AI agents and human-AI collaboration in domain-specific data science. AgentDS features 17 challenges spanning six industries: commerce, food production, healthcare, insurance, manufacturing, and retail banking. These challenges are built on synthetic datasets that are carefully designed to require domain-specific insights, multimodal data integration (e.g., images, text, structured files), and real-world plausibility. Generic machine learning pipelines are intentionally set to perform poorly, necessitating domain-informed feature engineering and data processing for strong results. The evaluation framework uses a quantile-based scoring system to normalize performance across challenges. A 10-day open competition involved 29 teams and 80 participants. To establish baselines, two AI-only approaches were also evaluated: a direct prompting baseline using GPT-4o and an agentic coding baseline using Claude Code, with performance compared against human-AI collaborative solutions from the competition.
The competition and baseline evaluations revealed three key findings. Firstly, current autonomous AI agents struggle significantly with domain-specific reasoning, especially when integrating multimodal signals. Many teams initially using fully autonomous agents shifted to human-guided workflows due to limitations. Secondly, human expertise remains essential, contributing capabilities like diagnosing model failures, injecting domain-specific knowledge through feature design, and making strategic decisions about model selection and generalization that AI lacks. Human participants frequently created features based on clinical protocols or business rules not inferable from data alone. Thirdly, human-AI collaboration consistently outperformed either humans or AI operating alone. The most successful approaches combined human strategic guidance with AI's ability to accelerate coding, experimentation, and iteration. Quantitatively, the GPT-4o direct prompting baseline achieved an overall quantile score of 0.143 (ranking 17th out of 29), while the Claude Code agentic baseline scored 0.458 (ranking 10th), both performing below the top human-AI collaborative teams across all domains.
The findings from AgentDS challenge the notion of fully autonomous data science through AI, suggesting that advanced AI systems are currently better utilized as collaborative tools rather than replacements for human experts. The study underscores the enduring importance of human expertise in tasks requiring strategic problem diagnosis, domain knowledge encoding, critical evaluation of AI-suggested solutions, and judgment beyond simple validation scores. Future progress in AI for data science should focus on designing systems that effectively support and augment human reasoning, facilitate domain knowledge integration, and enable iterative problem-solving in human-AI feedback loops. AgentDS provides a valuable benchmark for continuing to study these dynamics and developing AI that enhance human capabilities rather than striving for complete automation.