AI Summary • Published on Dec 3, 2025
Individual large language models (LLMs) struggle with reliability and consistency in clinical decision support tasks like medication recommendation, often leading to hallucinations and inconsistencies. While ensembling LLMs can improve performance, existing methods frequently overlook how models interact, potentially amplifying errors or biases instead of leveraging complementary strengths effectively. This inconsistency makes medication recommendations error-prone, especially when derived from unstructured clinical notes, and naive ensembles often fail to deliver stable and credible outputs.
The authors propose a Multi-LLM Collaboration approach, building on their "LLM Chemistry" framework, which quantitatively measures and leverages the collaborative compatibility among LLMs. This framework enables the creation of ensembles that are effective, stable, and calibrated by explicitly modeling synergistic and antagonistic relationships between models. The collaboration operates in two stages: Generation and Evaluation. In the Generation stage, a user request is distributed to a set of N=3 response generators (LLMs), chosen by one of four sampling strategies (REMOTE, LOCAL, RANDOM, or CHEMISTRY). Each LLM independently generates a medication recommendation. In the Evaluation stage, other LLMs anonymously review and grade these responses (on a scale of 0.0-1.0), ensuring balanced and independent assessment. A consensus-based estimation method, inspired by the Vancouver crowdsourcing algorithm, aggregates these grades to identify high-quality, consensus-backed recommendations and quantify the reliability of each participating model. Experiments utilized a dataset of 2020 synthetic clinical vignettes paired with domain-expert validated medication recommendations, generated by prompting LLMs in reverse. The study included ten proprietary LLMs (OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini) and four open-source models.
The CHEMISTRY-based multi-LLM collaboration strategy was rigorously evaluated across four key dimensions:
1. Efficiency: The CHEMISTRY ensemble demonstrated a significant efficiency advantage, producing recommendations in an average of 11 seconds. This was approximately nine times faster than the RANDOM (94.5s) and REMOTE (97.2s) strategies, and nearly 49 times faster than ensembles composed exclusively of LOCAL LLMs.
2. Effectiveness (Accuracy): The CHEMISTRY ensemble achieved an accuracy of 0.78, which was comparable to REMOTE ensembles (0.84) and notably outperformed all other sampling strategies.
3. Stability: The CHEMISTRY ensemble demonstrated stability comparable to multi-LLM ensembles composed of REMOTE models and significantly outperformed both LOCAL and RANDOM counterparts, with no observed execution failures, highlighting its consistent and dependable performance.
4. Calibration: The CHEMISTRY ensemble showed superior calibration with a mean variance of agreement of 0.05, which was substantially lower than REMOTE (0.11) and LOCAL (1.05) ensembles, indicating strong internal consensus and reduced noise among the collaborating models.
This work successfully demonstrated the feasibility and effectiveness of using Chemistry-based multi-LLM collaboration for generating reliable medical prescription recommendations from brief clinical notes. The proposed approach yielded high-quality, efficient, effective, stable, and calibrated multi-AI collaboration, consistently matching or outperforming other LLM sampling strategies (LOCAL, REMOTE, and RANDOM) across these reliability dimensions. These findings suggest that LLM Chemistry-guided collaboration offers a promising and practical path towards developing more reliable and trustworthy AI assistants for critical healthcare tasks. Future work will extend evaluations to larger, real-world datasets that include richer patient context (e.g., detailed notes, current medications, allergy information) and incorporate retrieval-augmented generation (RAG) capabilities to improve grounding and transparency, further enhancing generalizability and clinical safety.