AI Summary • Published on Oct 28, 2025
The diagnosis of most mental disorders, including psychiatric evaluations, relies primarily on subjective dialogues between psychiatrists and patients. This subjective process introduces significant variability, leading to inconsistencies in diagnoses across clinicians and patients. Such variability can result in misdiagnosis, delayed treatment, and challenges in achieving reliable outcomes, which are further exacerbated by factors like differences in clinical experience, interpersonal dynamics, cultural context, interpretation of symptom severity, time constraints, and potential biases.
This research proposes a Fine-Tuned Large Language Model (LLM) Consortium and OpenAI-gpt-oss Reasoning LLM-enabled Decision Support System to standardize psychiatric diagnoses. The system operates through four main layers: a Data Lake, an LLM Agent Layer, an LLM Layer, and an OpenAI-gpt-oss Reasoning Layer. The Data Lake serves as a repository for extensive, curated conversational datasets that simulate psychiatrist–patient interactions, complete with clinician-verified, DSM-5 aligned diagnoses. The LLM Layer consists of a consortium of state-of-the-art LLMs (Llama-3, Mistral, Qwen2) fine-tuned on these domain-specific datasets using the Unsloth library and 4-bit Quantized Low-Rank Adaptation (QLoRA) for efficient deployment. These fine-tuned models analyze conversational data to produce preliminary diagnostic predictions. The LLM Agent Layer acts as an orchestrator, dynamically generating prompts for the fine-tuned LLMs, collecting their individual predictions, and structuring them for the reasoning engine. The OpenAI-gpt-oss Reasoning Layer, serving as the final decision-making engine, synthesizes, evaluates, and refines the preliminary diagnoses from the LLM consortium, applying structured clinical logic to yield a robust, consensus-driven final diagnosis aligned with DSM-5 criteria. The prototype was developed in collaboration with the U.S. Army Medical Research Team, utilizing OpenAI Agents SDK and Google Agent Development Kit for agent implementation, and Google Colab with NVIDIA A100 GPUs and Tesla TPUs for fine-tuning a dataset of approximately 2,000 annotated records.
The evaluation demonstrated the effectiveness of the fine-tuning process, with training and validation loss curves showing rapid learning and stabilization, indicating good generalization with a modest generalization gap of approximately 2.41. The diagnostic performance of the fine-tuned LLM consortium significantly improved compared to baseline models. Fine-tuned Llama-3, Mistral, and Qwen2 models consistently produced concise, clinically valid diagnoses with accurate DSM-5 codes, directly inferred from patient-physician conversations, whereas their un-tuned counterparts were often verbose and lacked precision. For instance, Llama-3 accurately identified Major Depressive Disorder and Bipolar I Disorder, Mistral precisely classified Panic Disorder and PTSD, and Qwen2 clarified diagnoses for Schizophrenia and Generalized Anxiety Disorder. Furthermore, the diagnostic reasoning performance of the OpenAI-gpt-oss LLM was assessed by comparing its final diagnosis predictions against those from the individual fine-tuned LLMs. The results showed that OpenAI-gpt-oss effectively reconciled varying predictions from the consortium, enhancing diagnostic reliability by applying structured clinical logic to arrive at accurate, DSM-5-aligned outcomes.
This AI-assisted diagnostic framework offers a transformative approach to mental health diagnostics, providing a scalable and interpretable decision support tool that significantly enhances the accuracy, consistency, and transparency of psychiatric diagnosis. By addressing the inherent subjectivity of traditional assessments, the system provides data-driven insights and reduces diagnostic variability, paving the way for more reliable patient care. The use of low-rank adapters and quantization techniques ensures efficient deployment on consumer-grade hardware, making the system accessible in diverse clinical and remote care settings. This work represents the first end-to-end integration of fine-tuned LLMs with a reasoning engine for standardizing psychiatric diagnoses, establishing a foundation for future AI-powered eHealth systems. Future research will focus on clinical validation, multilingual adaptation, and integrating multimodal inputs such as voice, facial expressions, and affective signals to deepen diagnostic understanding and empathy.