AI Summary • Published on Dec 3, 2025
Standard Vision Transformers (ViT) process images by flattening 2D grids into 1D sequences, which disrupts the inherent spatial relationships. This flattening causes spatially distant patches, such as those at row edges, to be treated as immediate neighbors in the 1D sequence. While Rotary Positional Embedding (RoPE) is effective in 1D, its direct extension to 2D often treats spatial axes independently, failing to properly distinguish true spatial distance from the artificial proximity introduced by sequential flattening. This leads to weak cross-axis interactions and can compromise the model's ability to understand true geometric structures, a challenge further amplified in multi-modal learning.
The proposed Geometric Positional Embedding (GeoPE) framework extends RoPE's 2D complex-plane rotations into 3D Euclidean space using quaternions to explicitly model coupled rotations in structured tensors. Instead of independent axial methods, GeoPE treats spatial dimensions as a unified geometric entity. To address the non-commutativity of quaternion multiplication and ensure a consistent spatial prior, GeoPE constructs a unified rotational operator by computing the geometric mean in the logarithmic tangent space, leveraging Lie theory. This symmetric coupling prevents the model from collapsing 2D structures back into 1D patterns. GeoPE also naturally extends to three spatial dimensions for data like video or volumetric scans. A 'Linear GeoPE' variant is also introduced to enforce a linear relationship in the Lie algebra, ensuring relative rotations depend solely on spatial displacement, analogous to 1D RoPE.
GeoPE was extensively evaluated across various tasks. In ImageNet-1K image classification, GeoPE consistently improved Top-1 accuracy over standard baselines like APE and CPE on ViT models, and matched or exceeded Rope-Mixed on Swin Transformers. For object detection on the MS-COCO benchmark, GeoPE integrated into the DINO-ViTDet framework showed consistent improvements in mAP for both ViT-B and ViT-L backbones, demonstrating superior capture of global spatial relationships. Furthermore, GeoPE improved overall accuracy, mean class accuracy, and mean IoU in 3D point cloud semantic segmentation on the S3DIS dataset, validating its general applicability beyond 2D tasks. Critically, analysis using cue-conflict stimuli revealed that GeoPE significantly enhances shape bias, shifting models from texture-dependent sequence learners to shape-aware geometric learners.
GeoPE offers a principled approach to overcoming the spatial topological disruption caused by flattening structured tensors in Vision Transformers. By explicitly incorporating a geometrically coupled encoding, GeoPE enables models to achieve more effective spatial reasoning, leading to improved performance across diverse computer vision tasks. The enhanced shape bias observed with GeoPE suggests that it fosters more human-like object recognition, potentially leading to more robust and generalizable models. This work lays a foundation for future research in robust spatial modeling for any structured tensor data, with potential applications in multi-modal learning and beyond. While Linear GeoPE currently incurs higher latency due to implementation, the significant accuracy gains underscore the value of this geometric approach.