AI Summary • Published on Nov 8, 2025
Unit tests are crucial for verifying code correctness and documenting behavior, but often lack concise summaries, making them difficult and time-consuming to understand, especially in large or automatically generated test suites. While previous template-based and deep learning approaches aimed to generate summaries, they frequently produced verbose or redundant outputs. Large Language Models (LLMs) show promise for code comprehension, but their application to test code summarization is uniquely challenging. Unlike general code, test methods validate expected behavior through assertions, requiring LLMs to reason about validation intent rather than implementation logic. Prior studies on LLM-based test summarization were limited in scale and did not adequately explore the impact of test-specific structural features like assertion statements and messages.
This research introduces a benchmark of 91 real-world Java test cases, each paired with a developer-written summary. An ablation study was conducted using seven distinct prompt configurations to investigate the influence of test code components: the test method itself, Method Under Test (MUT), assertion messages, and natural language assertion semantics. Assertion semantics were generated by providing individual assertion statements to GPT-4o to derive concise natural language descriptions of their intent. Four instruction-tuned code LLMs (Codex, Codestral, DeepSeek, and Qwen-Coder) were evaluated. The evaluation utilized n-gram metrics (BLEU, ROUGE-L, METEOR) for lexical overlap, BERTScore for semantic similarity, and an LLM-based evaluation (LLM-Eval) where GPT-4o acted as a judge to assess human-aligned quality, clarity, and usefulness.
The study found that prompting LLMs with assertion semantics significantly improved the quality of generated test summaries. Specifically, configurations including assertion semantics improved LLM Eval scores by an average of 0.10 points (2.3%) compared to prompts relying solely on the full MUT context, while also requiring fewer input tokens. Codex and Qwen-Coder consistently produced summaries that aligned most closely with human-written ground truths across various metrics. In contrast, DeepSeek consistently underperformed in LLM-based human evaluations despite sometimes achieving high lexical overlap. Prompts that included the full context (test code, MUT, assertion messages, and semantics) did not always outperform more concise, assertion-focused inputs, suggesting diminishing returns with excessive information. Test-only prompts consistently yielded the lowest quality summaries.
The findings provide practical guidance for prompt engineering in the domain of test code summarization, emphasizing the effectiveness of incorporating lightweight, behavior-focused assertion-level features—such as assertion messages and their semantic interpretations—to guide LLMs. These assertion-enhanced prompts were shown to be more efficient and consistently yielded higher quality, more concise summaries than relying solely on the full Method Under Test context. The research also highlights a critical limitation of traditional n-gram metrics (like BLEU, ROUGE-L) in fully capturing the quality of LLM-generated summaries, underscoring the necessity of semantic fidelity and human-aligned evaluations. The benchmark and evaluation pipeline developed in this study are made publicly available to support future research in test code comprehension and the development of LLM-based documentation tools.