All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 3 results for this tag.
Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
MLLMs often lack transparent reasoning, merely providing final predictions without intermediate steps or visual evidence. This paper introduces the Visual Reasoning Tracer (VRT) task and associated benchmarks (VRT-Bench, VRT-80k) to explicitly require models to localize intermediate objects in their reasoning paths, significantly enhancing model interpretability and reliability.
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
DraCo introduces a novel interleaved reasoning paradigm, Draft-as-CoT, for text-to-image generation that leverages both textual and visual content. This approach addresses limitations of existing methods by generating low-resolution draft images for visual planning and verification, significantly improving the generation of rare attribute combinations and overall image quality.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
This paper introduces TempR1, a novel temporal-aware multi-task reinforcement learning framework designed to significantly enhance the temporal understanding capabilities of Multimodal Large Language Models (MLLMs). By integrating diverse temporal tasks and tailored reward functions, TempR1 achieves state-of-the-art performance across various video understanding benchmarks and improves generalization.