AI Summary • Published on Jun 13, 2025
Current financial statement auditing processes are largely manual, leading to inefficiencies and frequent errors despite existing information technologies. This reliance on human labor often results in missed misstatements and a failure to meet stakeholder expectations for transparency and accuracy. The exponential increase in data volume and complexity further compounds these challenges, creating an urgent need for intelligent automation. Previous research has primarily focused on specific aspects like improved sampling techniques or efficient data extraction, but has not adequately addressed low-level auditing challenges such as the extensive cross-verification of transaction data against financial disclosures, or the critical task of understanding and applying accounting standards with proper justification.
This research proposes leveraging Large Language Models (LLMs) to automate financial statement auditing. To evaluate their capabilities, the authors developed a novel and comprehensive benchmark. This benchmark utilizes a curated dataset that combines real-world financial tables, manually extracted from S&P 500 companies (resulting in 371 textual financial statements), with high-quality synthetic historical transaction data generated by GPT-4 and verified by humans. The dataset intentionally includes a balanced mix of correct and erroneous tables, with errors systematically injected, encompassing types such as missing rows, numerical errors, redundant rows, and misclassifications. A rigorous five-stage diagnostic framework was introduced to assess LLMs' performance: (1) General Judgment (binary classification of correctness), (2) Error Identification (identifying error types and problematic entries), (3) Error Resolution (proposing natural language amendments), (4) Standards Citation (referencing relevant accounting standards), and (5) Financial Statement Revision (applying direct modifications to tables). State-of-the-art LLMs, specifically GPT-3.5-turbo and GPT-4, were evaluated using metrics like Exact Match Score, BertScore, Top-K retrieval-based EM Score, and BLEU, along with an overall Success Rate.
The evaluation revealed that current LLMs are capable of accurately identifying error-free financial statements, suggesting their potential for an initial screening step in auditing. However, while models like GPT-3.5-turbo and GPT-4 excel at detecting general misalignments between transaction data and financial statements, their performance significantly deteriorates when tasked with more complex auditing functions. For instance, GPT-4 achieved only approximately 50% accuracy in identifying specific error types. A critical limitation observed was the LLMs' struggle to provide clear explanations for detected errors, cite relevant accounting standards, and execute comprehensive financial statement revisions. This performance decline was even more pronounced in scenarios involving multiple errors. The study attributes these limitations to two key challenges: a notable lack of domain-specific accounting knowledge, which hinders accurate interpretation and standards citation, and difficulties in performing joint reasoning across both tabular data and unstructured text, leading to fragmented analysis and incorrect conclusions during error location and correction.
The findings underscore a significant gap in the domain-specific accounting knowledge of current LLMs, indicating they are not yet sufficiently reliable as standalone auditors. To bridge this gap, future research should concentrate on two primary areas: first, integrating extensive domain-specific knowledge in accounting and finance, potentially through fine-tuning LLMs on specialized datasets, or by employing techniques like retrieval-augmented generation (RAG) and knowledge graphs to provide authoritative information. Second, efforts should focus on enhancing hybrid reasoning capabilities across diverse data modalities, such as tabular and textual data, possibly through multi-modal architectures and tailored pretraining strategies. Advancements in these areas promise to significantly improve the precision, reliability, and practical applicability of LLMs in financial auditing and other highly specialized fields. The benchmark and evaluation framework introduced in this paper lay a crucial foundation for the development of more effective automated auditing tools, which could substantially boost the accuracy and efficiency of real-world financial statement auditing practices.