AI Summary • Published on Jan 31, 2025
Large Language Models (LLMs) hold immense potential for revolutionizing data analytics by simplifying tasks like data discovery and SQL query synthesis through natural language. However, current LLMs, primarily trained on general web data, often struggle with the "messy" realities of real-world data lakes. They lack a deep understanding of data management concepts, complex database schemas, and the nuances of integrating diverse data sources, leading to poor performance on crucial analytics tasks and often yielding erroneous answers. Existing approaches like prompt engineering or fine-tuning for specific table understanding tasks do not adequately address the comprehensive understanding needed for complex multi-table relationships and business concept mapping.
The authors developed a novel data recipe for post-training LLMs, leading to the creation of CoddLLM, a 12-billion-parameter foundation model based on Mistral-NeMo-12B. This method involves a three-chapter training corpus: 1) Chapter 1 (Domain Knowledge): A scalable synthetic data generation approach using an "extraction-and-synthesis" strategy from web corpora to create over 8.8 million instruction-response pairs (0.9 billion tokens). This method grounds responses in reference documents to enhance diversity and reduce hallucination, focusing on fundamental data management and analysis knowledge. 2) Chapter 2 (Table-Text Alignment): Introduction of two new tasks: Text-to-Schema (generating database schemas from textual descriptions) and Row-to-Text (generating natural language descriptions for table rows). These tasks aim to bridge the gap between natural language and tabular data understanding. 3) Chapter 3 (Downstream Analytics Tasks): Inclusion of training examples for real-world analytics tasks, specifically Table Selection (identifying relevant tables for a natural language question) and Text-to-SQL conversion. To evaluate the model, new benchmarks were introduced: AnalyticsMMLU (multiple-choice questions on database management, data analysis, and machine learning) and WikiPage-TS (a human-annotated benchmark for multi-table selection with complex multi-hop reasoning questions), alongside existing datasets like BIRD-TS and Open-WikiTable-TS.
CoddLLM demonstrated superior performance across various data analytics benchmarks, achieving the highest overall average accuracy of 0.697. It consistently outperformed other open-source LLMs and several closed-source models. Specifically, CoddLLM surpassed GPT-3.5-Turbo on AnalyticsMMLU. For the Table Selection task, it outperformed GPT-4o by 12.1%. In Text-to-SQL evaluation, CoddLLM achieved an average execution accuracy of 0.576, showing a 24.9% improvement over the base model. Notably, on the unseen WikiPage-TS dataset, CoddLLM achieved a 93.7% relative improvement compared to its base model, demonstrating strong generalization capabilities to new, complex multi-table reasoning scenarios. Ablation studies further confirmed that instruction-aware data is crucial and that incorporating all three chapters of data significantly boosts performance.
This research marks a significant step towards developing expert foundation models for data analytics. By providing a meticulously curated, instruction-aware training corpus and introducing new challenging benchmarks, the authors demonstrate that LLMs can be effectively specialized to handle the complexities of real-world data management and analytical tasks. CoddLLM's strong performance across diverse tasks, particularly in data discovery and Text-to-SQL, suggests a promising future for natural language interfaces in data analytics, potentially enabling users to interact with data more intuitively without deep technical expertise. Future work could focus on integrating Retrieval-Augmented Generation (RAG) systems and advanced tool usage to further enhance model capabilities.