AI Summary • Published on Dec 2, 2025
Multimodal Large Language Models (MLLMs) currently face limitations in comprehensively understanding temporal dynamics within long-form videos, a crucial skill for advanced tasks like temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has been explored to improve temporal reasoning, existing methods are often restricted to narrow task types and limited datasets, thereby inhibiting their ability to generalize across diverse temporal scenarios. Similarly, supervised fine-tuning (SFT) approaches frequently lead to overfitting on specific temporal datasets and can compromise the model's general reasoning capabilities due to their rigid supervision. This highlights a critical need for a unified framework capable of systematically enhancing MLLMs' temporal comprehension across a broader spectrum of tasks and complex temporal structures.
TempR1 introduces a temporal-aware multi-task reinforcement learning framework built upon the Group Relative Policy Optimization (GRPO) algorithm to ensure stable and efficient cross-task optimization. The framework utilizes a comprehensive multi-task corpus, curated with over 60,000 samples from various datasets, covering five key temporal understanding tasks: Temporal Grounding (TG), Dense Temporal Grounding (DTG), Video Highlight Detection (VHD), Grounded Video Question Answering (GVQA), and Temporal Action Localization (TAL). To effectively address the diverse temporal properties inherent in these tasks, TempR1 categorizes them into three distinct types based on the correspondence between predicted intervals and ground-truth instances. For each type, custom-designed localization rewards are implemented. A universal format reward enforces machine-parsable outputs, while an additional classification reward is applied specifically for the GVQA task. The three localization reward types are:
TempR1 achieved state-of-the-art performance across a diverse range of temporal understanding benchmarks. In Temporal Grounding tasks, it surpassed previous methods on both Charades-STA and ActivityNet-Caption. For Video Highlight Detection on the QVHighlights dataset, TempR1 demonstrated a significant performance leap, outperforming the second-best model by 5.2 mIoU. The framework also showed strong performance on more complex tasks, including Dense Temporal Grounding on ActivityNet, Grounded Video QA on NExTGQA, and particularly on Temporal Action Localization (TAL) on ActivityNet-v1.3, where it achieved a substantial 13.0 mF1 point improvement over MUSEG. Ablation studies confirmed the critical contribution of TempR1's tailored localization reward components for TAL, highlighting the importance of both the instance number reward and the dynamic programming-based matching strategy. Furthermore, TempR1 demonstrated enhanced general video reasoning capabilities compared to SFT-based approaches, which often led to a degradation in general understanding. The results also indicated that multi-task training consistently yielded performance improvements, underscoring a synergistic effect among the diverse temporal understanding tasks.
The TempR1 framework establishes a scalable and principled paradigm for multi-task reinforcement learning within video temporal understanding. By systematically enhancing the temporal reasoning abilities of MLLMs through joint training on diverse tasks and the application of adaptable, task-aligned reward functions, TempR1 facilitates the development of more robust and generalizable MLLMs. This advancement is crucial for enabling more sophisticated and comprehensive long-form video analysis. The research demonstrates that reinforcement fine-tuning, coupled with carefully designed and scalable reward mechanisms, can significantly boost both the temporal comprehension and overall generalization capabilities of MLLMs across a wide array of video understanding scenarios.