AI Summary • Published on Feb 23, 2026
Vision-Language Models (VLMs) like CLIP have become highly versatile for multimodal tasks, yet their internal mechanisms often remain opaque. This raises critical questions about the nature of machine "understanding," especially when these models are applied to fields rich in historical and semantic meaning, such as art history. Artworks are interpreted through complex cultural conventions, not merely as labels, leading to concerns about the encoded biases from large, uncurated datasets. The central problem addressed is whether Explainable Artificial Intelligence (XAI) methods can effectively render the visual reasoning of CLIP legible to human interpreters, thereby improving the methodological robustness of VLMs in art-historical analysis.
The research employed a two-stage evaluation to systematically examine XAI methods. First, a quantitative case study assessed the zero-shot localization accuracy of seven XAI techniques: Grad-CAM, Grad-CAM++, LayerCAM, LeGrad, ScoreCAM, gScoreCAM, and CLIP Surgery. This was conducted using two art-historical datasets, IconArt and ArtDL, comprising nearly 2,000 images, to determine how effectively each method could identify and delineate objects in domain-specific imagery without fine-tuning. Localization accuracy was measured using BoxAcc across various Intersection over Union (IoU) thresholds. Second, an online human interpretability study involved 33 participants trained in art history. They were asked to annotate regions of relevance in artworks for specified classes and then rank the saliency maps generated by the same seven XAI methods based on how well they aligned with their own visual judgments. This aimed to understand how human users perceive and interpret these visual explanations.
Quantitatively, CLIP Surgery consistently demonstrated superior localization accuracy across both the IconArt and ArtDL datasets, particularly at an IoU threshold of 0.30, with LeGrad performing as the second-best method. CLIP Surgery showed higher accuracy for small, medium, and large objects. Differences in performance between the datasets were attributed to factors like the higher proportion of small objects and more historically charged motifs in IconArt compared to the broader, more generic categories in ArtDL. Qualitatively, the human interpretability study revealed that participants generally preferred CLIP Surgery, LeGrad, and ScoreCAM, ranking their saliency maps as most closely aligned with their own annotations. Inter-rater agreement was high for visually well-defined and spatially localized targets (e.g., "snake") but significantly lower for more diffuse or abstract categories (e.g., "lustful" or "Sphinx"), indicating the intrinsic ambiguity of higher-order visual concepts.
The study highlights that the effectiveness of XAI methods in art history is contingent on the conceptual stability and representational availability of the examined categories within the VLM's latent space. While methods like CLIP Surgery can localize visually distinct concepts, their accuracy diminishes for semantically complex or ambiguous motifs, suggesting that the model perceives statistical residues rather than historical context. Saliency maps can reproduce aspects of perceptual attention but do not fully replicate the interpretive depth of the art-historical gaze. The visual legibility of XAI outputs can be deceptive, as it exposes internal dynamics but conceals the cultural and linguistic priors that give activations meaning. Therefore, XAI in digital art history should foster a critical dialogue between human and machine vision, treating explanations as prompts for further hermeneutic inquiry rather than definitive statements. Additionally, computational efficiency considerations, such as the number of forward passes required, impact the practicality of XAI methods for real-time applications.