AI Summary • Published on Dec 3, 2025
The Visual Geometry Grounded Transformer (VGGT) is a powerful 3D vision foundation model capable of handling arbitrary-length image sequences for geometric perception and 3D reconstruction. However, its core limitation lies in its frame-global attention mechanism, which results in quadratic computational and memory complexity. This inefficiency renders VGGT impractical for processing large-scale scenes comprising hundreds or thousands of images; for instance, vanilla VGGT often runs out of memory with only 500 input images. Even optimized versions of VGGT struggle with long sequences, requiring substantial time and resources. Existing approaches to address these bottlenecks, such as sequential input handling, often sacrifice the advantageous single-pass end-to-end capability. Model quantization methods typically demand time-consuming cross-scene calibration, impacting generality. Furthermore, generic token merging strategies, while useful in other domains, fail to consider the unique geometrically coupled nature of VGGT's tokens, leading to a loss of critical geometric detail and persistent computational redundancy.
LiteVGGT proposes a novel geometry-aware cached token merging strategy to overcome the efficiency issues of VGGT while preserving high accuracy. The design is based on two key insights: tokens from local image regions exhibit inherent geometric correlations and redundancy, and token similarity remains stable across adjacent network layers. The method consists of several innovations:
LiteVGGT demonstrates substantial improvements in efficiency and scalability while maintaining competitive accuracy for both 3D reconstruction and camera pose estimation across various datasets. It achieves up to a 10x speedup and significant memory reduction over VGGT, enabling efficient processing of 1000-image scenes that would typically cause vanilla VGGT to run out of memory. Specifically, geometry-aware token merging alone provides over a 4x latency reduction. Caching merge indices further reduces latency by approximately 25%, and FP8 quantization contributes an additional 33% latency reduction and about 25% memory savings. In 3D reconstruction tasks, LiteVGGT delivers the lowest Chamfer Distance error on ScanNet-50 and maintains accuracy comparable to state-of-the-art methods on 7Scenes and NRGBD. It also surpasses FastVGGT on DTU and Tanks & Temples, achieving nearly a 10x speedup over VGGT while preserving strong overall completeness and geometric consistency in reconstructed point clouds, despite being slightly less detailed than the original VGGT in some qualitative comparisons. For camera pose estimation, LiteVGGT achieves accuracy comparable to VGGT and outperforms FastVGGT in certain metrics, even after applying aggressive optimizations. Ablation studies confirm the effectiveness of each component, showing that geometry-aware token merging, cached merge indices, fine-tuning, and FP8 quantization individually contribute to the overall performance and efficiency. Furthermore, practical robotic grasping experiments validate LiteVGGT's reliability, demonstrating sufficient accuracy for real-world applications.
LiteVGGT provides an efficient and scalable solution to the computational and memory challenges faced by 3D vision foundation models like VGGT. By demonstrating that high geometric fidelity can be preserved through carefully designed token reduction strategies without resorting to computationally intensive global attention, this work opens avenues for more practical and deployable large-scale 3D applications. The significant speedups and memory savings achieved by LiteVGGT enable it to process long image sequences that were previously beyond the capabilities of VGGT, making it highly valuable for real-world scenarios such as autonomous navigation, augmented reality, and robotics. This research highlights the potential of lightweight architectures in pushing the boundaries of what is feasible in large-scale 3D reconstruction. Future work aims to extend this framework to handle video inputs and even more complex and expansive scenes.