AI Summary • Published on Dec 2, 2025
Traditional attention mechanisms, core to modern foundation models, face a significant bottleneck due to their quadratic computational complexity, especially when applied to long sequences in advanced video understanding and generation tasks. This quadratic cost makes it prohibitively expensive to process high-resolution, long-duration videos, dominating inference times. Current efficient attention methods primarily rely on sparsity, using binary masks to retain or discard entire key-value blocks. However, this approach leads to substantial information loss at high sparsity levels, as critical contextual information can be discarded. Some attempts to alleviate this involve token permutation, but these often conflict with causal attention masks and introduce additional computational overhead, making them inefficient to implement.
Pyramid Sparse Attention (PSA) is proposed as a versatile module to mitigate the limitations of existing sparse attention mechanisms. Instead of rigid binary masking, PSA introduces multi-level pooled Key-Value (KV) representations. This is achieved by building a hierarchical pyramid of KV blocks through progressive 1D mean pooling, creating coarse-to-fine contextual representations. A Multi-Level Mask Generator dynamically estimates the importance of each query-key block pair. For video generation, this involves a sampling-based strategy, while for video understanding, it uses antidiagonal scoring with an intra-block similarity verification. Based on these importance scores, a multi-level mask is generated, assigning finer pooling levels to critical KV blocks and coarser levels to less important ones, with the least important blocks skipped entirely. This allows for a gradual degradation of precision rather than abrupt information loss. The Adaptive Pyramid Attention then computes attention, fetching the appropriate KV block based on its assigned pyramid level and applying a scaling factor to maintain consistent probability distributions. PSA's design is complemented by a hardware-friendly kernel that uses a decoupled block-tile design, separating logical block size from hardware tile size. This ensures efficient GPU utilization, even with heterogeneous block sizes resulting from pooling, and is fully compatible with existing technologies like FlashAttention.
PSA consistently demonstrates superior performance and efficiency across both video understanding and generation benchmarks compared to existing sparse attention baselines. In training-free video generation experiments using Wan2.1 models, PSA preserved sharp details and temporal coherence, achieving higher similarity metrics (PSNR, SSIM, LPIPS) and perceptual quality scores (Aesthetic, Background, Imaging Quality) at comparable or higher sparsity levels. When integrated with the distillation framework TDM on CogVideoX-5B, PSA enabled a 30.2x denoising time speedup without compromising generation quality, and even surpassed the original 50-step model's VBench scores. For video understanding, PSA matched or exceeded full-attention accuracy on the Video-MME dataset with Qwen2.5-VL-7B, particularly improving performance on medium and long videos at significantly higher sparsity. Ablation studies confirmed that the multi-level masking strategy effectively reduces information loss compared to binary masks, and that a threshold-based mask generation, combined with a cosine similarity constraint for pooling, optimizes performance. The decoupled block-tile hardware implementation achieved up to a 10x speedup over naive PSA implementations.
Pyramid Sparse Attention offers a significant advancement in addressing the computational bottlenecks of attention mechanisms in long-context video models. By introducing a multi-level pooling approach, PSA effectively mitigates the information loss inherent in binary sparse attention, leading to superior quality-efficiency trade-offs for both video understanding and generation. Its hardware-friendly design ensures practical deployment on modern accelerators, making it a viable solution for scaling foundation models to handle increasingly complex video data. The demonstrated compatibility with other optimization techniques like distillation further highlights PSA's versatility and potential to drive future innovations in efficient and high-quality video processing.