All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 8 results for this tag.
Rethinking VLMs for Image Forgery Detection and Localization
This paper investigates how to effectively utilize Vision-Language Models (VLMs) for image forgery detection and localization (IFDL). It introduces IFDL-VLM, a novel two-stage pipeline that decouples the core IFDL task from VLM-based explanation generation, leveraging localization masks to significantly enhance VLM interpretability and achieving state-of-the-art results across multiple benchmarks.
On the Explainability of Vision-Language Models in Art History
This paper investigates the effectiveness of Explainable Artificial Intelligence (XAI) methods in making Vision-Language Models (VLMs), specifically CLIP, interpretable within art-historical contexts. It evaluates seven XAI methods through zero-shot localization experiments and human interpretability studies, concluding that their effectiveness depends on the conceptual stability and representational availability of the examined categories.
Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems
This paper introduces Chameleon, an adaptive adversarial framework that exploits image downscaling vulnerabilities in Vision-Language Models (VLMs) to inject hidden malicious visual prompts. By employing an iterative, feedback-driven optimization mechanism, Chameleon can craft imperceptible perturbations that hijack VLM execution and compromise agentic decision-making systems.
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
This paper introduces MemLoRA, a novel memory system that enables efficient, on-device deployment of memory-augmented language models by equipping small models with specialized LoRA adapters. It also presents MemLoRA-V, extending these capabilities to multimodal contexts with native visual understanding.
STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models
This paper introduces Stage-Aware Reinforcement (StARe), a novel module that decomposes long-horizon robotic manipulation tasks into semantically meaningful stages, providing dense, interpretable reinforcement signals. Integrated into the Imitation → Preference → Interaction (IPI) fine-tuning pipeline, StARe significantly improves the performance and robustness of Vision-Language-Action (VLA) models on complex manipulation tasks.
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
This paper introduces SpaceTools, a vision-language model trained with Double Interactive Reinforcement Learning (DIRL) to achieve precise spatial reasoning and real-world robot manipulation by effectively coordinating multiple external tools. It demonstrates state-of-the-art performance on various spatial understanding benchmarks.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
AdaptVision introduces an efficient VLM paradigm that autonomously determines the minimum number of visual tokens required for each sample by employing a coarse-to-fine visual acquisition strategy, leading to superior performance with significantly reduced computational overhead.
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR explores the innovative approach of compressing long text contexts into visual tokens to overcome computational challenges in Large Language Models. This vision-language model, featuring a novel DeepEncoder and DeepSeek3B-MoE decoder, demonstrates impressive OCR precision with significant compression ratios and state-of-the-art performance on document parsing benchmarks.