AI Summary • Published on Mar 12, 2026
The proliferation of Artificial Intelligence Generated Content (AIGC) has made image manipulation increasingly accessible, posing significant challenges for image forgery detection and localization (IFDL). Current approaches using Vision-Language Models (VLMs) often suffer from inherent biases, as these models are typically pretrained on data that emphasizes semantic plausibility rather than authenticity. This bias can lead to negative impacts on IFDL performance and hinder the interpretability of results. The main challenges for IFDL include the joint utilization of high-level semantics and low-level artifacts, achieving generalization across various forgery types, and ensuring the explainability and trustworthiness of detection and localization outcomes. Existing VLM-based methods, like SIDA and FakeShield, integrate components such as CLIP, LLMs, and SAM, but their pretraining does not inherently equip them with forgery-specific concepts, making their direct application problematic due to their insensitivity to subtle visual inconsistencies.
To address the limitations of existing VLM-based IFDL methods, the authors propose IFDL-VLM, a novel two-stage framework that decouples detection and localization from language explanation generation. In Stage 1, a Vision Transformer (ViT) encoder is jointly trained with SAM to perform image forgery detection and localization. This stage focuses on learning forgery-specific cues, where a global CLS token is used for classification and a SEG token guides SAM to generate precise localization masks. Unlike prior work, this stage develops an expert ViT model tailored for IFDL, mitigating the VLM’s biases. In Stage 2, the derived detection and localization results from Stage 1, particularly the localization masks, are used as auxiliary inputs to fine-tune the VLM. These masks explicitly encode forgery-related concepts, which eases the VLM’s optimization process for generating language explanations and significantly enhances interpretability. This region-aware visual feature enhancement strategy enriches the VLM’s visual tokens with low-level cues from forged regions, thereby improving the distinction between authentic and manipulated image representations.
The IFDL-VLM framework was extensively evaluated on 9 popular benchmarks, demonstrating new state-of-the-art performance across detection, localization, and interpretability. For detection, the method achieved an impressive 99.7% accuracy and 99.8% F1 score on the SID-Set. In terms of localization, IFDL-VLM obtained 0.65 IoU on the SID-Set, representing a substantial 21% absolute improvement over SIDA, and an average IoU of 0.47 across 8 diverse datasets, outperforming FakeShield by 13%. The interpretability of IFDL-VLM was also significantly higher, achieving an overall GPT-5 score of 2.44, a 0.77 improvement over SIDA. A user study further corroborated these findings, with 65.2% of human evaluators preferring IFDL-VLM's explanations over SIDA-13B's. Additionally, the model demonstrated superior performance in Cosine Semantic Similarity (CSS) evaluations, with a weighted CSS score of 0.62, an 8.8% increase compared to SIDA-13B, and consistently outperformed baselines on traditional natural language generation metrics like BLEU-1, ROUGE-L, METEOR, and CIDEr.
This research provides crucial insights into effectively leveraging Vision-Language Models for image forgery detection and localization by systematically addressing their inherent biases towards semantic plausibility. The proposed IFDL-VLM framework highlights the importance of decoupling the core detection and localization tasks from the generation of VLM-based language explanations. By explicitly incorporating forgery-related concepts through localization masks, the framework significantly enhances VLM optimization and interpretability. The achieved state-of-the-art performance in detection, localization, and interpretability, coupled with strong generalization capabilities across diverse datasets, positions IFDL-VLM as a robust and advanced solution for combating image manipulation in the era of AIGC. Furthermore, the Stage-2 LLM acts as a secondary verification mechanism, demonstrating robustness to imprecise localization masks and mitigating the risk of hallucinated explanations, thereby increasing the trustworthiness of the generated analyses.