Research Guy

Problem

The Visual Geometry Grounded Transformer (VGGT) is a powerful 3D vision foundation model capable of handling arbitrary-length image sequences for geometric perception and 3D reconstruction. However, its core limitation lies in its frame-global attention mechanism, which results in quadratic computational and memory complexity. This inefficiency renders VGGT impractical for processing large-scale scenes comprising hundreds or thousands of images; for instance, vanilla VGGT often runs out of memory with only 500 input images. Even optimized versions of VGGT struggle with long sequences, requiring substantial time and resources. Existing approaches to address these bottlenecks, such as sequential input handling, often sacrifice the advantageous single-pass end-to-end capability. Model quantization methods typically demand time-consuming cross-scene calibration, impacting generality. Furthermore, generic token merging strategies, while useful in other domains, fail to consider the unique geometrically coupled nature of VGGT's tokens, leading to a loss of critical geometric detail and persistent computational redundancy.

Method

LiteVGGT proposes a novel geometry-aware cached token merging strategy to overcome the efficiency issues of VGGT while preserving high accuracy. The design is based on two key insights: tokens from local image regions exhibit inherent geometric correlations and redundancy, and token similarity remains stable across adjacent network layers. The method consists of several innovations:

Geometry-aware Feature Map: To quantify each token's geometric importance, LiteVGGT constructs a feature map by fusing pixel gradient information (capturing edges and textures via the Sobel operator) and token variance (measuring semantic-geometric variability). This map effectively distinguishes high-information tokens critical for 3D reasoning.
Geometry-aware Token Partitioning: Tokens are categorized for optimized merging. 'GA Tokens' (top 10% highest-scoring tokens from the feature map) are preserved to retain crucial geometric details. 'Dst Tokens' serve as merge anchors, selected for spatial coherence and efficiency (all first-frame tokens for world coordinate consistency, and one lowest-GA-score token per 2x2 grid in subsequent frames). 'Src Tokens' comprise all remaining tokens, which are designated for merging.
Cached Token Merging: Src tokens are merged into their most similar Dst Tokens using cosine similarity-based feature averaging, reducing the sequence length for global attention. To further minimize redundant computation, merge indices are cached and reused across multiple layers (computed once every six layers), leveraging the observed stability of inter-layer token similarity.
Token Unmerging: Before feeding into the prediction heads, the token sequence is restored to its original length by replicating the merged tokens. This ensures that VGGT's dense outputs, such as depth maps and point clouds, do not suffer from degradation in detail.
Engineering Optimizations: LiteVGGT is further enhanced through fine-tuning of the aggregator and prediction heads on diverse datasets, and by applying FP8 quantization using NVIDIA's Transformer Engine during inference, which significantly reduces memory footprint and latency with minimal accuracy impact.

Results

LiteVGGT demonstrates substantial improvements in efficiency and scalability while maintaining competitive accuracy for both 3D reconstruction and camera pose estimation across various datasets. It achieves up to a 10x speedup and significant memory reduction over VGGT, enabling efficient processing of 1000-image scenes that would typically cause vanilla VGGT to run out of memory. Specifically, geometry-aware token merging alone provides over a 4x latency reduction. Caching merge indices further reduces latency by approximately 25%, and FP8 quantization contributes an additional 33% latency reduction and about 25% memory savings. In 3D reconstruction tasks, LiteVGGT delivers the lowest Chamfer Distance error on ScanNet-50 and maintains accuracy comparable to state-of-the-art methods on 7Scenes and NRGBD. It also surpasses FastVGGT on DTU and Tanks & Temples, achieving nearly a 10x speedup over VGGT while preserving strong overall completeness and geometric consistency in reconstructed point clouds, despite being slightly less detailed than the original VGGT in some qualitative comparisons. For camera pose estimation, LiteVGGT achieves accuracy comparable to VGGT and outperforms FastVGGT in certain metrics, even after applying aggressive optimizations. Ablation studies confirm the effectiveness of each component, showing that geometry-aware token merging, cached merge indices, fine-tuning, and FP8 quantization individually contribute to the overall performance and efficiency. Furthermore, practical robotic grasping experiments validate LiteVGGT's reliability, demonstrating sufficient accuracy for real-world applications.

Implications

LiteVGGT provides an efficient and scalable solution to the computational and memory challenges faced by 3D vision foundation models like VGGT. By demonstrating that high geometric fidelity can be preserved through carefully designed token reduction strategies without resorting to computationally intensive global attention, this work opens avenues for more practical and deployable large-scale 3D applications. The significant speedups and memory savings achieved by LiteVGGT enable it to process long image sequences that were previously beyond the capabilities of VGGT, making it highly valuable for real-world scenarios such as autonomous navigation, augmented reality, and robotics. This research highlights the potential of lightweight architectures in pushing the boundaries of what is feasible in large-scale 3D reconstruction. Future work aims to extend this framework to handle video inputs and even more complex and expansive scenes.

Research Guy

Understand New Research — Instantly

Daily AI-generated explanations of the latest arXiv papers.

Research Guy

Research Guy

LiteVGGT: Boosting Vanilla VGGT via Geometry-aware Cached Token Merging

Problem

Method

Results

Implications