AI Summary • Published on Jul 30, 2025
The processing of tabular data by Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) presents unique difficulties due to tables' two-dimensional, flexible, and often complex structures. Unlike linear text, tables require specialized approaches, leading to a fragmented field with a multitude of diverse input representations, tasks, and methods. This specialization makes it challenging for researchers to navigate the landscape and understand the broader opportunities. Existing benchmarks often focus on retrieval-based tasks requiring minimal reasoning, struggle with complex or large-scale tables, and exhibit limited model generalization across different tabular formats, indicating a need for more robust evaluation and foundational understanding.
This survey systematically addresses the challenges in tabular data understanding by LLMs through a comprehensive overview. The authors introduce a taxonomy of tabular input representations, detailing methods like serialization, database schema, image-based formats, and specialized table encoders. They also categorize and explain a broad range of table understanding tasks, including Table Question Answering (TQA), Table-to-Text generation, and Table Fact Verification, along with emerging applications such as leaderboard construction. The paper identifies key benchmarks for these tasks, analyzing their strengths and limitations, particularly in assessing higher-level reasoning and complex input scenarios.
The survey highlights several critical findings in the field. Firstly, current models are showing saturation on many retrieval-focused benchmarks, with advanced methods achieving high accuracy on datasets like Wiki-Table Questions and TabFact. However, these benchmarks often rely on simplistic tables or queries that can be solved with basic logical operations. Secondly, models face significant limitations when dealing with complex table structures (e.g., hierarchical tables), large-scale tables, lengthy contextual information, or scenarios involving multiple tables, where human performance far surpasses current model capabilities. Benchmarks like HiTab and MULTIHIERTT demonstrate model performance below 50% compared to human accuracy over 80%. Lastly, models exhibit poor generalization across diverse tabular representations, with performance varying significantly based on how closely input formats align with pre-training data. This issue is exacerbated by the lack of universal representations and inconsistencies in benchmark input formats.
The findings imply several promising directions for future research. There is a strong need to move beyond simple retrieval-based tasks towards benchmarks that demand higher-order reasoning, such as insight identification, forecasting, and prescriptive thinking (e.g., chart creation from ambiguous queries). Developing more robust models capable of handling complex inputs like hierarchical, multi-table, or large-scale data is crucial. Furthermore, research should focus on enhancing model generalization across different tabular representations through standardized serialization options, serialization-to-serialization tasks, and investigating the effectiveness of various representations for complex structures. Combining image-based and text-based inputs for comprehensive structural and detailed content understanding also presents an underexplored avenue. The survey underscores the importance of developing more realistic and challenging evaluation benchmarks that mirror real-world complexities.