Articles tagged with: Vision-Language Models

Showing 6 results for this tag.

Advanced·Dec 3, 2025

Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems

This paper introduces Chameleon, an adaptive adversarial framework that exploits image downscaling vulnerabilities in Vision-Language Models (VLMs) to inject hidden malicious visual prompts. By employing an iterative, feedback-driven optimization mechanism, Chameleon can craft imperceptible perturbations that hijack VLM execution and compromise agentic decision-making systems.

Multimodal AI

AI Security

Adversarial Attacks

Vision-Language Models

Advanced·Dec 3, 2025

STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models

This paper introduces Stage-Aware Reinforcement (StARe), a novel module that decomposes long-horizon robotic manipulation tasks into semantically meaningful stages, providing dense, interpretable reinforcement signals. Integrated into the Imitation → Preference → Interaction (IPI) fine-tuning pipeline, StARe significantly improves the performance and robustness of Vision-Language-Action (VLA) models on complex manipulation tasks.

Vision-Language Models

Reinforcement Learning

Robotics

Intermediate·Dec 3, 2025

MemLoRA: Distilling Expert Adapters for On-Device Memory Systems

This paper introduces MemLoRA, a novel memory system that enables efficient, on-device deployment of memory-augmented language models by equipping small models with specialized LoRA adapters. It also presents MemLoRA-V, extending these capabilities to multimodal contexts with native visual understanding.

Language Models

Vision-Language Models

On-Device AI

Advanced·Dec 2, 2025

AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition

AdaptVision introduces an efficient VLM paradigm that autonomously determines the minimum number of visual tokens required for each sample by employing a coarse-to-fine visual acquisition strategy, leading to superior performance with significantly reduced computational overhead.

Computational Efficiency

Vision-Language Models

Reinforcement Learning

Advanced·Dec 2, 2025

SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL

This paper introduces SpaceTools, a vision-language model trained with Double Interactive Reinforcement Learning (DIRL) to achieve precise spatial reasoning and real-world robot manipulation by effectively coordinating multiple external tools. It demonstrates state-of-the-art performance on various spatial understanding benchmarks.

Spatial Reasoning

Vision-Language Models

Reinforcement Learning

Advanced·Oct 20, 2025

DeepSeek-OCR: Contexts Optical Compression

DeepSeek-OCR explores the innovative approach of compressing long text contexts into visual tokens to overcome computational challenges in Large Language Models. This vision-language model, featuring a novel DeepEncoder and DeepSeek3B-MoE decoder, demonstrates impressive OCR precision with significant compression ratios and state-of-the-art performance on document parsing benchmarks.

Optical Character Recognition

Vision-Language Models

Large Language Models

Research Guy

All Tags

Research Guy

Understand New Research — Instantly

Daily AI-generated explanations of the latest arXiv papers.

Research Guy

Research Guy

All Tags

Research Guy

Research Guy

Articles tagged with: Vision-Language Models