AI Summary • Published on Jan 12, 2026
Modern AI systems often exhibit similar predictive performance despite employing vastly different internal decision-making processes. These groups of equally performing models form what is known as a Rashomon set. While explainable artificial intelligence (XAI) aims to shed light on these internal workings and help distinguish models based on criteria beyond accuracy (e.g., fairness), the explanations themselves can be inconsistent and require rigorous evaluation. Current explanation evaluation methods, including both ground-truth and sensitivity-based approaches, frequently fail to accurately capture genuine behavioral differences among models in a Rashomon set. Furthermore, they are susceptible to "fairwashing" attacks, where adversarial manipulations hide discriminatory model behavior behind misleading explanations, eroding trust and hindering responsible AI deployment.
To address the shortcomings of existing evaluation techniques, this paper introduces three fundamental principles for evaluating feature-importance explanations: local contextualization, model relativism, and on-manifold evaluation. Local contextualization dictates that explanations should be specific to the datapoint being explained. Model relativism emphasizes that explanations must reflect the unique internal mechanisms of each model, even within a Rashomon set, to reveal behavioral differences. On-manifold evaluation ensures that explanations are based solely on observable model behavior within the data manifold, disregarding off-manifold anomalies. Based on these principles, the authors propose AXE (Agnostic eXplanation Evaluation), a novel framework that assesses explanation quality by measuring the "predictiveness" of the top-N most important features. AXE achieves this by training a unique k-Nearest Neighbors (k-NN) model for each datapoint, using only its top-N features to predict the original model's output, rather than the true label. This design ensures adherence to all three proposed principles, enabling AXE to accurately identify model behavior without relying on ground-truth explanations or being misled by off-manifold perturbations.
The evaluation demonstrated that conventional ground-truth-based metrics, such as Feature Agreement and Rank Agreement, often violate the principles of model relativism and local contextualization. These metrics can assign identical quality scores to explanations from models within a Rashomon set, even when their underlying decision processes significantly differ, thus obscuring crucial behavioral distinctions. Critically, in experiments simulating adversarial fairwashing attacks—where discriminatory models are altered to generate seemingly benign explanations—existing sensitivity-based metrics like PGI and PGU exhibited a 50% failure rate in identifying the true discriminatory behavior. In stark contrast, AXE achieved a 100% success rate across various real-world datasets, including German Credit, COMPAS, and Communities and Crime, and against models designed to fool prominent explainers like SHAP and LIME. This indicates AXE's superior capability to correctly pinpoint when protected attributes are used for predictions, even under sophisticated adversarial manipulation.
The development of AXE and its underlying principles marks a significant advancement in the rigorous evaluation of XAI methods. By providing a robust framework that can differentiate models within a Rashomon set and reliably detect adversarial fairwashing, this work enhances the trustworthiness and accountability of AI systems. It empowers practitioners to make more informed decisions when selecting models for deployment, ensuring that models are chosen not just for their predictive accuracy but also for their ethical and transparent decision-making processes. The ability of AXE to prevent the masking of discriminatory practices is crucial for upholding fairness and promoting responsible innovation in artificial intelligence, fostering greater public confidence in AI technologies.