AI Summary • Published on Jan 12, 2026
Text-to-SQL techniques are increasingly critical for data analytics, and public benchmarks like BIRD and Spider 2.0 are essential for comparing and selecting these methods. However, the reliability of these benchmarks hinges on the accuracy of human annotations. Previous work has indicated the presence of annotation errors in benchmarks such as BIRD, yet a comprehensive investigation into the extent of these errors and their impact on text-to-SQL agent performance and leaderboard rankings has been lacking. Such inaccuracies can significantly misguide researchers in advancing text-to-SQL technologies and practitioners in choosing optimal agents for real-world applications.
The authors conducted an empirical study involving a rigorous human-in-the-loop, three-stage audit to benchmark annotation error rates in BIRD Mini-Dev and Spider 2.0-Snow. To aid in this process, they developed SAR-Agent (SQL Annotation Reviewer agent), an AI tool that assists SQL experts in detecting annotation errors by incrementally verifying annotations through multi-turn interactions with a database. Errors were categorized into four patterns: E1 (mismatches between SQL query semantics and natural language logic), E2 (mismatches between SQL query semantics and database schema/data), E3 (mismatches with external domain knowledge), and E4 (ambiguity in the natural language input). They also introduced SAPAR (SQL Annotation Pipeline with an AI Agent Reviewer), integrating SAR-Agent into the annotation workflow, to manually correct a sampled subset of 100 examples from the BIRD Development set. This correction process involved revising natural language questions, external knowledge, and ground-truth SQL queries, and in some cases, modifying database data to ensure errors were detectable. Finally, they re-evaluated 16 open-source text-to-SQL agents from the BIRD leaderboard on both the original and the corrected subsets to quantify the impact of annotation errors on performance and rankings.
The study uncovered alarmingly high annotation error rates: 52.8% in BIRD Mini-Dev and 62.8% in Spider 2.0-Snow, with errors related to understanding database schema or data (E2) being the most prevalent. These errors were found to significantly distort text-to-SQL agent performance and leaderboard rankings. Re-evaluation of the 16 agents on the corrected BIRD Dev subset showed relative performance changes ranging from a decrease of 7% to an increase of 31%, with ranking shifts of up to 9 positions. For instance, the CHESS agent improved its execution accuracy from 62% to 81%, climbing from 7th to 1st place. A weak correlation (Spearman's r_s = 0.32, p=0.23) between rankings on the original and corrected Dev subsets underscored the unreliability of existing leaderboards. Furthermore, SAR-Agent proved effective in error detection, achieving 83% precision on BIRD Mini-Dev and 89% on Spider 2.0-Snow, and identified 41.6% more errors than previous human expert audits in BIRD Mini-Dev.
The pervasive annotation errors discovered in prominent text-to-SQL benchmarks severely compromise their integrity, potentially misleading researchers in their scientific pursuits and practitioners in their selection of text-to-SQL agents for deployment. The toolkit developed in this work, comprising SAR-Agent and SAPAR, provides a robust and efficient solution for detecting and correcting these errors, thereby paving the way for the creation of higher-quality and more reliable benchmarks. This research strongly advocates for the integration of such advanced error detection and correction pipelines into future benchmark development processes, ensuring more accurate evaluations and fostering meaningful advancements in both text-to-SQL research and its practical applications.