AI Summary • Published on Dec 3, 2025
Large Language Models (LLMs) have significantly impacted various sectors, including healthcare, by enhancing capabilities like text generation and comprehension. However, their ability to perform zero-shot information extraction from complex and variable clinical language, particularly in non-English Electronic Health Records (EHRs) like Italian, remains largely underexplored. Traditional Natural Language Processing (NLP) techniques often fall short in handling the nuanced semantics of free-text clinical notes. This research aims to assess if LLMs can effectively extract comorbidity information from Italian EHRs in a zero-shot, on-premises setting, and if they can serve as a viable substitute for established regular expression-based extraction methods.
The study employed a methodology to evaluate six open-source multilingual LLMs from the OpenLLaMA, Mistral, and Qwen2.5 families (including 3B, 7B, and 8x7B models) in a zero-shot configuration for comorbidity extraction. The dataset comprised 8223 Italian patient records, focusing on five critical cardiac comorbidities. Initially, regular expressions, developed with clinical expert collaboration, were used for automated data annotation to establish a baseline for comparison. To refine the ground truth and address potential inaccuracies from regex, 100 "false negative" records were manually annotated by two clinicians. The LLMs were then tasked with classifying each comorbidity individually per EHR using a standard zero-shot prompt. Their performance was rigorously evaluated against both the automated (regex) and manual annotations, utilizing metrics such as classification accuracy, F1-score, precision, and recall to provide a comprehensive understanding of their effectiveness.
When compared against automated regular expression annotations, OpenLLaMA 3B and Mixtral 8x7B models exhibited low overall accuracies, falling below 35%. In contrast, OpenLLaMA 7B, Mistral 7B, and both Qwen2.5 3B and 7B models achieved overall accuracies exceeding 70%, with Mistral 7B demonstrating the highest performance at 82.67%. However, a detailed analysis of F1-score, precision, and recall revealed that many LLMs struggled with generalization. For example, OpenLLaMA 3B showed high recall but poor precision, indicating a propensity for false positives, while Mistral 7B achieved excellent precision but low recall, suggesting it missed many actual cases. Against manual annotations, OpenLLaMA 3B again performed very poorly (<10% overall accuracy). While other models like OpenLLaMA 7B, Mistral 7B, Mixtral 8x7B, and Qwen2.5 (3B and 7B) achieved over 80% accuracy in this comparison (with significant improvements for Qwen2.5 and Mixtral 8x7B), a confusion matrix analysis for the best (Mistral 7B) and worst (OpenLLaMA 3B) performers indicated that they either consistently predicted negatives or positives, rather than demonstrating true semantic understanding. Ultimately, none of the evaluated LLMs matched the performance of the regular expression approach, which achieved 92.2% accuracy against manual annotation.
The findings strongly suggest that deploying multilingual LLMs in a zero-shot, on-premises setting for extracting comorbidity information from Italian EHRs is not advisable. Despite some initially promising accuracy figures, a deeper analysis reveals that these models lack the necessary generalization capabilities and trustworthiness required for high-risk domains like healthcare, where misclassifications can lead to severe consequences. The study concludes that the selected LLMs cannot effectively substitute traditional pattern-matching approaches in extraction pipelines in their current zero-shot state. The authors emphasize the critical importance of thorough testing, the selection of appropriate global metrics, and continuous monitoring for issues like hallucinations before LLMs are deployed in sensitive fields. Future research will explore advanced techniques such as In-Context Learning (ICL) and fine-tuning to enhance LLM performance and foster greater trustworthiness in healthcare-related language processing tasks.