All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 1 results for this tag.
Training-Free Policy Violation Detection via Activation-Space Whitening in LLMs
This paper introduces a training-free method for detecting policy violations in Large Language Models by treating it as an out-of-distribution problem in the activation space. The approach utilizes activation-space whitening and the Euclidean norm as a compliance score, outperforming existing guardrails and fine-tuned models while offering high interpretability and efficiency.