Problem
As Large Language Models (LLMs) are increasingly integrated into critical applications within organizations, ensuring their compliance with internal policies and external regulations has become paramount. Current content moderation and alignment methods, such as guardrails, often lack the robustness and flexibility to handle nuanced organizational policies, relying on fixed categories or handcrafted rules. Alternatives like "LLM-as-a-judge" or fine-tuning approaches, while offering more flexibility, introduce significant latency, lack interpretability, and often require extensive computational resources and large, curated datasets, hindering their real-time deployment and adaptability to evolving policies.
Method
This work proposes a novel, training-free, and efficient method that frames policy violation detection as an out-of-distribution (OOD) detection problem within the activation space of LLMs. The core idea is that policy-compliant responses occupy a consistent region in the activation space, while violations represent deviations from this distribution. The method involves several key steps:
- Activation-level Policy Modeling: Policy adherence is cast as an OOD detection problem, where hidden activations from policy-compliant user-LLM interactions are modeled as an in-distribution manifold.
- Whitening Transform: A small set of in-policy samples is used to estimate the empirical mean and covariance of activations for each transformer layer. A linear whitening transformation (specifically, PCA-based whitening) is then applied to decorrelate these hidden activations, standardizing them to zero mean and unit variance, resulting in an approximately identity covariance matrix. This creates a standardized activation space where deviations are uniformly measurable.
- Compliance Scoring: In this transformed, whitened space, the Euclidean norm of the whitened activation vector is used as a compliance score. A lower norm indicates stronger in-policy conformity, with the squared Euclidean norm being equivalent to the Mahalanobis distance in the raw activation space.
- Operational Layer Selection and Threshold Calibration: A separate, small mixed calibration set (containing both in- and out-of-policy samples) is used to select the optimal operational layer (the layer yielding the highest ROC-AUC for separating compliant from violating responses) and to set a decision threshold that maximizes Youden's J statistic for balanced true- and false-positive rates.
- Runtime Detection: During deployment, for each model response, its activation at the chosen operational layer is centered, whitened using the precomputed matrix, and scored. If this score exceeds the calibrated threshold, the response is flagged as a policy violation. This process works in both white-box settings (direct activation access) and black-box settings (using a surrogate model for activation proxies), adding negligible latency.
Results
The proposed method was extensively evaluated on the challenging DynaBench policy dataset, utilizing open-source models like Llama 3.1-8B and Qwen2.5-7B. Key findings include:
- The approach achieved state-of-the-art results, surpassing both existing guardrails (LlamaGuard) and fine-tuned baselines (DynaGuard) by up to 9% in F1 score, as well as LLM-as-a-judge models (GPT-4o-mini, Qwen3-8B).
- Performance demonstrated robustness and efficiency, remaining stable across a wide range of retained principal components (k) and improving only marginally with larger calibration sample sizes. For instance, 100 samples per category achieved an F1 score of 74.3%, with only a modest gain to 77.7% using 750 samples.
- Layer-wise analysis revealed that policy-specific signals emerge at varying depths across different policy categories, with most clustering in mid-to-late layers, but some appearing earlier, justifying the per-layer guard selection strategy.
- The method introduced negligible latency: 0.03–0.05 seconds in white-box settings and under one second in black-box scenarios, making it highly suitable for real-time monitoring and large-scale deployment.
Implications
This work provides organizations with a practical, statistically grounded, and deployable framework for policy-aware oversight of LLMs, significantly advancing the broader goal of AI governance. The training-free nature, efficiency, and interpretability of the method allow for rapid adaptation to new policies, requiring only a small number of illustrative samples without the need for fine-tuning or external evaluators. Its modular design supports the deployment of per-class guards, enabling continuous monitoring of score distributions and recalibration as policies evolve. The framework's interpretability, lightweight footprint, and flexibility across various access regimes position it as a principled and essential building block for developing trustworthy and policy-compliant LLM systems in sensitive domains.