Research Guy

Problem

The rapid advancement of Vision-Language Models (VLMs) has significantly expanded their capabilities, but concurrently created a much larger and less constrained attack surface for adversarial exploits. Existing multimodal jailbreak techniques primarily focus on superficial pixel manipulations, typographic embeddings, or overtly harmful images. These methods largely overlook the complex semantic structures inherent in natural visual data, leaving deep-seated semantic vulnerabilities within unmodified images largely unexamined. Current VLM safety mechanisms are predominantly optimized for detecting explicit textual maliciousness or clear perceptual perturbations, making them ineffective against attacks that use semantic camouflage and disperse malicious intent across multimodal reasoning chains. This leads to models generating policy-violating content while misclassifying it as legitimate visual analysis. Furthermore, existing attack paradigms suffer from static heuristics, stateless execution (lack of persistent memory for strategy refinement), and latent-blind prompting (ignoring the model's internal safety mechanisms), often leading to premature refusals and inefficient attacks.

Method

To address these limitations, the authors introduce MemJack, a Memory-augmented multi-agent Jailbreak Attack framework. MemJack operates through a coordinated three-stage pipeline driven by three agents and supported by two persistent memory modules:

Strategic Planning Agent: This agent analyzes an image to identify exploitable visual anchors and maps them to malicious attack goals, aligning with the victim model's safety policies. It inspects images through four priority levels of threats and applies a realism constraint.

Iterative Attack Agent: Using the identified anchors and goals, this agent generates adversarial prompts through six complementary attack angles (e.g., Visual Intuitive Association, Hypothetical Reasoning, Practical Knowledge Exploration) and employs Monte Carlo Tree Search for refinement. Crucially, a Null-Space Semantic Filter, based on Iterative Nullspace Projection (INLP), pre-screens candidate prompts against the model's internal safety latent space to minimize premature refusals.

Evaluation & Feedback Agent: This agent closes the loop by using a Safety Guard (Qwen3Guard-Gen) to evaluate the victim VLM's response and assign a continuous risk score. If an attack fails, a Reflection module diagnoses the defense pattern, recommends angle adjustments, and generates corrected prompts, especially for "near-miss" cases. If all angles under an anchor are exhausted, control returns to the Strategic Planning Agent for replanning with a new anchor.

MemJack also incorporates two persistent modules:

Multimodal Experience Memory: This module stores and retrieves successful attack strategies across different images using embedding similarity, enabling cross-image strategy transfer. Strategies are updated via temporal-difference learning.

Jailbreak Knowledge Graph: This module records causal relationships between visual anchors, attack goals, strategies, and defense patterns, providing structured priors that guide future angle selection and prompt refinement.

Together, MemJack forms a closed-loop "plan–attack–reflect" cycle within each image, while the memory modules continuously accumulate knowledge across images, enhancing attack efficiency for new visual contexts.

Results

Extensive empirical evaluations demonstrated MemJack's high effectiveness, query efficiency, and broad generalization. When tested on 5,000 unmodified COCO val2017 photographs with Qwen3-VL-Plus as the victim VLM, MemJack achieved an Attack Success Rate (ASR) of 71.48%, with an average of only 5.18 rounds required for a successful attack. This ASR increased to 90% when the attack budget was extended to 100 rounds, indicating that most natural images can eventually be weaponized.

MemJack also showed strong generalization across diverse image distributions, maintaining ASRs between 62% and 91% across seven additional image benchmarks. Furthermore, it proved effective against eleven different VLMs (both commercial APIs and open-source models), with ASRs ranging from 35% (Gemini-3-Flash) to 82% (Mistral-Medium-3), confirming its ability to bypass various architectural and safety alignment strategies.

Compared to existing baselines, MemJack substantially outperformed both text-only methods (e.g., GCG, AutoDAN-Turbo) and multimodal attacks (e.g., Visual-Adv, FigStep, HADES, QR-Attack). MemJack achieved 72% ASR in black-box settings and 53% in white-box settings, significantly higher than the best baseline's 30% and 18% respectively. Ablation studies confirmed that the Multimodal Experience Memory was the most impactful component, contributing a 34 percentage point increase in ASR and nearly halving the average rounds to success. The Reflection module and Dynamic Replanning also contributed an additional 5% and 6% ASR respectively.

Qualitative analysis revealed that Visual Intuitive Association, Practical Knowledge, and Hypothetical Reasoning were the most common successful attack angles, by grounding harmful queries in concrete visual elements. While direct refusal was the most frequent defense, benign reframing proved the most resilient. The study also created MemJack-Bench, a comprehensive dataset of over 113,000 interactive multimodal jailbreak attack trajectories, which effectively differentiates VLM robustness, unlike existing static benchmarks.

Implications

MemJack demonstrates a significant and previously under-explored vulnerability in Vision-Language Models: their susceptibility to jailbreak attacks initiated through benign, unmodified natural images. This research highlights the need for VLMs to evolve beyond text-centric safety mechanisms and develop more robust defenses against semantic camouflage in visual data. The framework provides an efficient, automated paradigm for generating large-scale, diverse safety alignment datasets (MemJack-Bench), which is crucial for training inherently more robust VLMs.

The findings also suggest that future VLM defense strategies should prioritize sophisticated responses like benign reframing over direct refusals, as direct refusals inadvertently provide clear gradient signals that adaptive attackers can exploit. By understanding these deep-seated visual-semantic vulnerabilities and providing tools like MemJack-Bench, this work paves the way for the development of more secure and resilient multimodal AI systems.

Research Guy

Understand New Research — Instantly

Daily AI-generated explanations of the latest arXiv papers.

Research Guy

Research Guy

Every Picture Tells a Dangerous Story: Memory-Augmented Multi-Agent Jailbreak Attacks on VLMs

Problem

Method

Results

Implications