AI Summary • Published on Mar 4, 2026
Deep Convolutional Neural Networks (CNNs), despite their high performance in computer vision, suffer from a lack of transparency in their decision-making processes. This opacity is a significant concern in critical applications like medical diagnosis and autonomous driving, where understanding model predictions is crucial. Explainable AI (XAI) addresses this by making models interpretable, with Class Activation Map (CAM) methods being widely used to visualize influential input regions. However, existing CAM approaches face trade-offs: gradient-based methods like Grad-CAM offer fine-grained, discriminative details but are often noisy and highlight only salient parts, missing complete object coverage. Conversely, region-based methods such as Score-CAM provide broader object coverage but can be over-smoothed and less sensitive to subtle features. Ensemble methods that combine these often have limitations like suppressing relevant activations or using fixed, non-adaptive merging rules, leading to incomplete or less precise explanations.
Fusion-CAM proposes a novel three-step framework to integrate the complementary strengths of gradient-based and region-based Class Activation Maps (CAMs). The process begins with Gradient-Based CAM Denoising, where low-activation noise in gradient-based maps (e.g., Grad-CAM) is filtered out using a percentile threshold (empirically found to be optimal between 10-20%). This step ensures cleaner and more focused maps, preserving discriminative activations. Next, the Combination of Denoised Gradient and Region-Based CAMs occurs. The refined gradient-based map is linearly merged with a region-based map (e.g., Score-CAM). This aggregation is guided by contribution weights, which are determined by applying each map as a spatial mask on the input image and measuring its impact on the model's class score compared to a neutral baseline. This step aims to enrich region-level maps with fine-grained details and restore the target object's full spatial extent. The final and core innovation is CAMs Fusion, an adaptive, pixel-level, similarity-based mechanism. The combined map and the region-based map are re-weighted based on updated contribution scores. A pixel-wise similarity measure then quantifies the agreement between these two weighted maps. For regions with high similarity, the maximum activation is chosen to reinforce consistent evidence. Conversely, for conflicting regions with low similarity, a soft average is computed. This adaptive strategy ensures that Fusion-CAM produces activation maps that are both spatially coherent and highly class-discriminative, preserving complementary information while emphasizing reliable activations.
Fusion-CAM's performance was rigorously evaluated on general-purpose datasets (ImageNet, PASCAL VOC) and specialized plant disease detection datasets, using various CNN architectures including VGG16, ResNet50, and MobileNet. The method consistently outperformed existing Class Activation Map (CAM) variants, such as Grad-CAM, Score-CAM, and Union-CAM, both qualitatively and quantitatively. Qualitatively, Fusion-CAM generated more precise, well-localized, and complete visual explanations, effectively capturing entire objects, fine-grained details in complex scenarios like plant diseases, and accurately localizing multiple instances. Quantitatively, it demonstrated superior faithfulness in representing the CNN's decision-making process. Fusion-CAM achieved the lowest Average Drop (AD) and highest Average Increase (AI) in confidence across all tested datasets, with notable improvements such as an AD of 13.25% and AI of 42.25% on ILSV2012, and an AD of 6.17% and AI of 12.80% in plant disease detection. Furthermore, it attained the highest overall scores in deletion and insertion metrics, indicating a strong alignment between highlighted regions and the model's predictions. Ablation studies confirmed that each stage of the Fusion-CAM pipeline—denoising, confidence-weighted aggregation, and similarity-aware pixel blending—contributes cumulatively to its enhanced performance. While computationally more intensive than single-paradigm methods, Fusion-CAM offers a superior balance between explanation quality and generation time compared to other ensemble approaches.
Fusion-CAM offers a robust and effective solution for enhancing the interpretability of deep convolutional neural networks. By generating saliency maps that are both highly discriminative and provide comprehensive contextual coverage, it allows users to better understand the rationale behind model predictions. This improved transparency is particularly critical for deploying deep learning models in safety-critical domains. The novel fusion paradigm introduced by Fusion-CAM holds significant promise for adaptation to emerging deep learning architectures, such as Vision Transformers, where understanding complex decision processes is paramount. Ultimately, by providing more faithful and complete visual explanations, Fusion-CAM contributes to building greater trust and confidence in the responsible and effective deployment of AI systems in real-world applications.