AI Summary • Published on Dec 3, 2025
Efficient streaming video generation is essential for simulating interactive and dynamic virtual worlds. Existing methods, which distill few-step video diffusion models using sliding window attention and initial frames as sink tokens, often suffer from error accumulation. While attention sink mechanisms alleviate this, they lead to a new challenge: an over-reliance on static initial tokens, causing diminished motion dynamics, repetitive initial frames, and a lack of natural evolution in subsequent frames. Standard distribution matching distillation struggles with this issue because even videos with poor motion dynamics can still have good visual quality and closely match the teacher distribution, making them difficult to optimize for motion. Furthermore, the limited attention window in current autoregressive models discards older frames, creating an information bottleneck that leads to a loss of global awareness and results in temporal inconsistencies and quality degradation over longer video sequences.
Reward Forcing is a novel framework designed to achieve high visual and dynamic fidelity in efficient streaming video generation, built upon two key technical innovations. Firstly, it introduces EMA-Sink, an exponential moving average (EMA) state packaging mechanism. Instead of using static initial tokens, EMA-Sink maintains fixed-size tokens initialized from the first frames, which are continuously updated by fusing evicted tokens (as they exit the sliding window) via an exponential moving average. This process compresses global context to maintain attention performance and incorporates recent dynamics to prevent over-attention to initial frames, ensuring long-term consistency without extra computational cost. Secondly, to improve motion dynamics distillation from teacher models, the paper proposes Rewarded Distribution Matching Distillation (Re-DMD). Unlike traditional distribution matching, Re-DMD distinguishes and prioritizes samples with greater dynamics. It utilizes a powerful vision-language model as a reward function to rate samples based on their motion quality and then weights the distribution matching gradients according to these scores. This biases the model towards generating high-quality motion while preserving data fidelity. The entire framework generates video chunks autoregressively by conditioning on self-generated previous outputs via a KV cache, effectively bridging the train-test gap.
Reward Forcing achieves state-of-the-art performance across various benchmarks for both short and long video generation. Quantitatively, the method attained a real-time generation speed of 23.1 FPS on a single H100 GPU, demonstrating a 47.14x speedup over SkyReels-V2 and a 1.36x speedup over Self Forcing. On 5-second video benchmarks (VBench), Reward Forcing achieved an overall score of 84.13, outperforming all existing baselines. For long video generation (60 seconds, using MovieGen prompts and VBenchLong metrics), it scored 81.41, significantly surpassing the state-of-the-art LongLive (79.53). Notably, the method showed an 88.38% boost in the dynamic metric (66.95) while maintaining quality consistency. Further evaluation using Qwen3-VL on 55-60 second videos confirmed superior performance in visual quality, motion dynamics, and text alignment. A user study involving 20 participants also validated the method's superiority, with high scores for long-range temporal consistency (3.60), dynamic complexity (3.72), and overall preference (3.75) on a 4-point Likert scale. Ablation studies demonstrated the crucial contributions of both EMA-Sink and Re-DMD to the model’s performance in maintaining dynamism and consistency.
Reward Forcing effectively addresses the challenge of motion stagnation in efficient streaming video generation, achieving a balance between high visual fidelity and strong dynamic motion. This work sets a new benchmark for performance and efficiency in generating dynamic, interactive virtual worlds and offers a general-purpose, plug-and-play solution for integration into existing video generation architectures. The reduced computational demands could broaden access to video synthesis technology, benefiting various users and promoting sustainable AI development. However, the authors acknowledge the potential for misuse, such as creating deepfakes or spreading misinformation. They advocate for implementing digital watermarking, developing detection tools, clear content labeling, and establishing robust usage policies to mitigate these risks. Future research will focus on developing more sophisticated reward models to capture nuanced video quality aspects and ensuring ethical considerations are continually addressed in generative video technologies.