AI Summary • Published on Dec 2, 2025
Current methods for reconstructing and understanding 3D scenes, particularly those using 3D Gaussian Splatting, often generate an excessive number of redundant per-pixel Gaussians from unposed sparse views. This leads to substantial memory consumption and inefficient multi-view feature aggregation. Consequently, these approaches suffer from degraded performance in tasks like novel view synthesis and 3D scene understanding, prompting the fundamental question of whether such dense, pixel-aligned representations are truly necessary for comprehensive scene understanding.
C3G proposes a novel feed-forward framework that learns compact 3D representations using only 2,048 Gaussians. Instead of per-pixel prediction, C3G employs a transformer-based architecture with learnable query tokens that discover and decode essential 3D Gaussians from multi-view image features. These tokens aggregate relevant visual information across views through self-attention, learning geometrically coherent regions without explicit depth supervision. The framework also introduces C3G-F, a view-invariant feature decoder that reuses the attention patterns learned during Gaussian decoding to efficiently lift arbitrary 2D features into 3D, thereby addressing challenges of multi-view inconsistency and correspondence identification. Training relies on photometric reconstruction combined with a progressive low-pass filter to ensure robust Gaussian localization.
C3G demonstrates competitive visual quality in novel view synthesis on the RealEstate10k dataset, utilizing 65 times fewer Gaussians than per-pixel methods, which enables significantly faster rendering. In 3D open-vocabulary segmentation tasks on ScanNet and Replica, the compact Gaussians, coupled with multi-view aggregated semantic features, consistently outperform previous feed-forward approaches and achieve performance comparable to or better than optimization-based methods. Furthermore, C3G-F substantially improves two-view correspondence evaluations across various visual encoders (VGGT, DINOv2, DINOv3), validating its effectiveness as a view-invariant feature decoder and a superior feature upsampler compared to existing solutions like AnyUp.
The C3G framework offers a highly memory-efficient and performant alternative to traditional dense 3D Gaussian Splatting methods for scene reconstruction and understanding. By proving that a compact, geometrically meaningful representation is sufficient for high-quality results, it addresses critical issues of redundancy and computational overhead. This approach opens new directions for feed-forward 3D computer vision, potentially fostering advancements in areas like robotics, multimodal AI, and enabling more scalable and efficient processing of 3D scene data without the need for extensive per-pixel estimations. Future work could explore extending this compact representation to dynamic scenes.