AI Summary • Published on Sep 26, 2025
The WikiSQL dataset, a widely used resource for training and evaluating text-to-SQL systems, suffers from significant structural and annotation issues. These include data type mismatches, inconsistencies in case sensitivity, syntax errors, and natural language questions that do not yield answers even when their corresponding SQL queries are syntactically correct. These limitations compromise the dataset's reliability in practical applications and research, leading to potentially inflated performance metrics for models. Furthermore, the original WikiSQL format, designed for pointer-network models, is not well-suited for modern Large Language Models (LLMs) that generate full SQL queries as plain text.
The authors systematically revised and transformed WikiSQL into LLMSQL to address its inherent issues. Key problems identified and rectified included approximately 140 tables with incomplete column names, various datatype conflicts (e.g., numbers stored as strings with formatting), and numerous duplicate tables and questions. A significant portion of queries (49.25%) returned empty results, with 41.22% attributed to case sensitivity mismatches, which were resolved by programmatically adjusting string literal cases to match the natural language query or table values.
The non-intuitive, numeric-placeholder SQL format of WikiSQL was replaced with standard, human-readable SQL queries, making the dataset compatible with any standard SQL database. For evaluating LLMs, the authors designed specific prompts including few-shot examples with synthetic sample rows to guide model generation. Evaluation was based on execution accuracy, where a generated query is correct if its execution on SQLite (using Python client 3.11.11) yields identical results to the ground truth. A regex-based extraction strategy was implemented to parse SQL queries from potentially verbose LLM outputs, considering up to 10 candidate queries. Additionally, a fine-tuning scenario was conducted, dividing LLMSQL into train, validation, and test splits and training models using Cross-Entropy Loss with consistent hyperparameters.
In zero-shot and few-shot settings, model performance generally correlated with size, though not strictly monotonically (e.g., Gemma 3 4B outperformed Mistral 7B 7B). Accuracy consistently improved from 0-shot to 1-shot and 5-shot. DeepSeek R1 0528 achieved the highest accuracy at 88.4% in 0-shot, closely followed by OpenAI o4-mini at 86.45% in 5-shot. Larger models demonstrated a performance plateau, suggesting they adapted quickly to task instructions without extensive reliance on few-shot examples. Many models, however, generated overly complex or unsupported SQL constructs. In the fine-tuning scenario, smaller models (e.g., Gemma3 4B IT, Llama3.2 1B Instruct, Phi3.5 Mini Instruct, Qwen2.5 1.5B Instruct) showed substantial improvements, achieving over 90% execution accuracy, indicating enhanced understanding of dataset-specific structures. Larger models also improved but generally remained below 90% accuracy, suggesting a need for more specialized fine-tuning strategies. The authors highlight LLMSQL's relevance given that real-world SQL workloads often involve simpler query patterns.
LLMSQL serves as a reliable, structurally sound, and semantically consistent benchmark for modern LLMs in Text-to-SQL tasks, effectively revitalizing the classic WikiSQL dataset for current research. The work demonstrates that both large, reasoning-oriented LLMs perform well in zero-shot settings, and even relatively smaller models can surpass 90% accuracy when fine-tuned on the cleaned LLMSQL dataset. This resource is expected to promote more transparent and practical research in natural language interfaces to databases. Future enhancements for LLMSQL include adding more questions to tables, introducing JOIN queries and new data types like dates and times to increase complexity, implementing multilingual support, and integrating with other benchmarks to expand its scope and diversity while maintaining its core simplicity and usability.