All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 6 results for this tag.
STARE-VLA: Progressive Stage-Aware Reinforcement for Fine-Tuning Vision-Language-Action Models
This paper introduces Stage-Aware Reinforcement (StARe), a novel module that decomposes long-horizon robotic manipulation tasks into semantically meaningful stages, providing dense, interpretable reinforcement signals. Integrated into the Imitation → Preference → Interaction (IPI) fine-tuning pipeline, StARe significantly improves the performance and robustness of Vision-Language-Action (VLA) models on complex manipulation tasks.
MemLoRA: Distilling Expert Adapters for On-Device Memory Systems
This paper introduces MemLoRA, a novel memory system that enables efficient, on-device deployment of memory-augmented language models by equipping small models with specialized LoRA adapters. It also presents MemLoRA-V, extending these capabilities to multimodal contexts with native visual understanding.
Chameleon: Adaptive Adversarial Agents for Scaling-Based Visual Prompt Injection in Multimodal AI Systems
This paper introduces Chameleon, an adaptive adversarial framework that exploits image downscaling vulnerabilities in Vision-Language Models (VLMs) to inject hidden malicious visual prompts. By employing an iterative, feedback-driven optimization mechanism, Chameleon can craft imperceptible perturbations that hijack VLM execution and compromise agentic decision-making systems.
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL
This paper introduces SpaceTools, a vision-language model trained with Double Interactive Reinforcement Learning (DIRL) to achieve precise spatial reasoning and real-world robot manipulation by effectively coordinating multiple external tools. It demonstrates state-of-the-art performance on various spatial understanding benchmarks.
AdaptVision: Efficient Vision-Language Models via Adaptive Visual Acquisition
AdaptVision introduces an efficient VLM paradigm that autonomously determines the minimum number of visual tokens required for each sample by employing a coarse-to-fine visual acquisition strategy, leading to superior performance with significantly reduced computational overhead.
DeepSeek-OCR: Contexts Optical Compression
DeepSeek-OCR explores the innovative approach of compressing long text contexts into visual tokens to overcome computational challenges in Large Language Models. This vision-language model, featuring a novel DeepEncoder and DeepSeek3B-MoE decoder, demonstrates impressive OCR precision with significant compression ratios and state-of-the-art performance on document parsing benchmarks.