AI Summary • Published on Dec 3, 2025
Existing legal benchmarks for Large Language Models (LLMs) are often limited, focusing on isolated tasks and outcomes rather than comprehensively evaluating true legal general intelligence. These benchmarks frequently overlook "soft legal intelligence" aspects like ethical judgment and societal impact. Furthermore, they face challenges such as data contamination from static public datasets and lack a structured framework to assess the nuanced stages of legal reasoning, often mistaking pattern mimicry for genuine legal understanding.
To address these limitations, the authors propose LexGenius, an expert-level Chinese legal benchmark meticulously constructed from recent legal exam questions and judicial cases to ensure originality and minimize data leakage. LexGenius is built upon a "Dimension-Task-Ability" framework, encompassing seven dimensions, eleven tasks, and twenty atomic legal intelligence abilities, informed by educational, legal hermeneutic, and problem-solving theories. Multiple-choice questions (MCQs) are generated through a hybrid approach combining LLM-based generation and adaptation of existing questions. A rigorous three-step workflow—data collection and structuring, MCQ construction, and a double-blind manual review by law master's candidates—guarantees the legal accuracy, reasoning rigor, and competency alignment of the 8,385 high-quality MCQs. For evaluation, the benchmark applies its three-level framework to calculate scores at the ability, task, and dimension levels. The study assessed twelve state-of-the-art LLMs using both naive and Chain-of-Thought (CoT) prompting strategies, establishing a human baseline for comparison.
The evaluation of twelve state-of-the-art LLMs on LexGenius revealed significant gaps between LLM capabilities and human legal experts, particularly in areas requiring nuanced value judgments, contextual trade-offs, and institutional understanding, such as legal reasoning, judicial practice, and legal ethics. While LLMs performed adequately on static, knowledge-based tasks, they demonstrated considerably weaker performance in dynamic reasoning, legal application analysis, and case reasoning. Even with Chain-of-Thought prompting, improvements were primarily at a surface level, failing to bridge the gap in higher-order legal intelligence. The study also highlighted systematic immaturity in LLMs' "legal soft intelligence" and their struggles with ambiguity, conflict, and ethical considerations in legal language. Further analysis indicated a "triple decoupling" phenomenon: model scale did not linearly correlate with legal intelligence, CoT could lead to negative transfer in closed-solution legal tasks, and supervised fine-tuning sometimes caused catastrophic forgetting in stronger models. This suggests that domain-specific pretraining and reinforcement learning that aligns rewards with evaluative capacity are more effective for advancing legal AI.
LexGenius provides a critical framework for evaluating legal general intelligence, but the authors outline several future directions to enhance its realism and generalizability. These include incorporating multimodal legal tasks (e.g., images, audio) to assess visual perception and cross-modal reasoning, expanding linguistic and jurisdictional coverage beyond Chinese civil law to support global legal services, and introducing dynamic temporal reasoning to evaluate models' understanding of time-bound legal applicability and evolving legal frameworks. These planned enhancements aim to better reflect the complexities of real-world legal environments and foster the development of more robust, universally applicable AI systems in the legal domain.
Dec 3, 2025