AI Summary • Published on Nov 24, 2025
AI agents are becoming essential in enterprise workflows, relying on shared tool libraries and pre-trained components, which introduces significant supply chain security vulnerabilities. While prior research has shown behavioral backdoor detection within individual Large Language Model (LLM) architectures, the crucial aspect of how these detection methods generalize across different LLMs has been largely unexplored. This gap poses serious risks for organizations that deploy multiple AI systems, as existing single-model detectors may fail to protect against backdoors in different LLM environments.
The researchers conducted the first systematic study on cross-LLM behavioral backdoor detection, evaluating its generalization across six diverse production LLMs (GPT-5.1, Claude Sonnet 4.5, Grok 4.1, Llama 4 Maverick, GPT-OSS 120B, and DeepSeek Chat V3.1). They generated 1,198 execution traces from 100 varied agent tasks and simulated two types of backdoors: data poisoning via few-shot examples and tool manipulation. A detection system was developed, consisting of trace collection, extraction of 51 behavioral features across temporal, sequence, action, and data-flow categories, and classification using Support Vector Machine (SVM) and Random Forest models. They performed 36 cross-model experiments to quantify the generalization gap and analyzed feature stability using the coefficient of variation (CV). Four detection strategies were compared: single-model, pooled training, ensemble voting, and model-aware detection, where model identity was included as an additional feature.
The study found a critical generalization gap: single-model detectors achieved 92.7% accuracy within their trained LLM distribution but only 49.2% accuracy when applied to different LLMs, a substantial 43.4 percentage point drop, effectively equivalent to random guessing. This failure was attributed to model-specific behavioral signatures, particularly in temporal features which exhibited high variance (CV >> 0.8) across different LLM architectures. Each LLM displayed backdoors through distinct behavioral patterns, meaning a detector trained for one model's signature could not recognize another's. A simple deployment strategy, "model-aware detection," which incorporates the LLM's identity as an additional feature, successfully achieved 90.6% universal accuracy across all evaluated models. This demonstrated that while the generalization gap is severe, it can be effectively mitigated with appropriate multi-LLM training.
These findings have immediate and significant practical implications for organizations. Those using multiple LLMs cannot rely on single-model backdoor detectors and require unified detection strategies. The model-aware approach offers a practical path forward for heterogeneous LLM deployments, though it requires representative training data for each deployed model and retraining for major model updates. For security-critical deployments, confidence thresholding and human review are recommended to manage false positives and negatives. The research also highlights the need for LLM providers to consider publishing model-specific behavioral baselines and standardizing trace formats. Future work should focus on robustness against adaptive adversaries, few-shot adaptation for new LLMs, truly model-agnostic detection, and evaluating detection at enterprise scale.