AI Summary • Published on Jan 29, 2026
Training large language models (LLMs) demands significant hardware and computational resources, leading to substantial energy consumption and high costs. Existing state-of-the-art optimizers like AdamW, while widely used, rely on diagonal curvature estimates, ignore structural properties, and incur high memory overhead. Muon, another advanced optimizer, applies global spectral normalization but sacrifices curvature information, which can lead to suboptimal performance. Historically, conventional manifold optimization methods have been overlooked for LLM training due to their perceived poor performance in large-scale model optimization, leaving a gap for more efficient and effective training approaches.
This study proposes Mano, a novel and efficient optimizer that re-approaches manifold optimization for LLM training. Unlike traditional manifold optimization that constrains parameters directly, Mano applies a soft manifold constraint by projecting the momentum onto the tangent space of model parameters. This projected momentum is then constrained on a rotational Oblique manifold. The parameters themselves are updated via Euclidean descent with weight decay, rather than strict retraction onto the manifold surface. A key innovation is the rotational manifold scheme, which alternates between column-wise normalization on odd iterations and row-wise normalization on even iterations, providing a dynamic orientation. This design choice leverages the computational efficiency of the Oblique manifold and is empirically shown to align well with the model's learning trajectory, yielding shorter geodesic distances compared to other manifolds. Mano significantly reduces computational complexity by avoiding expensive matrix decompositions and MatMul operations, requiring at most 11mn FLOPs per update for a parameter matrix of size m x n. It also halves the memory footprint compared to Adam-based optimizers. The paper provides a convergence analysis for a simplified version of Mano, proving convergence guarantees under common assumptions.
Extensive experiments on LLaMA (130M, 350M, 1.3B) and Qwen3 (0.6B, 1.7B) models across C4 and Pile datasets demonstrate that Mano consistently and significantly outperforms both AdamW and Muon in test perplexity. This superior performance is achieved while simultaneously consuming less memory than AdamW and having lower computational complexity than Muon. Analysis of training dynamics showed that Mano maintains lower gradient variance and a higher Signal-to-Noise Ratio (SNR) compared to Muon, indicating superior training stability. Ablation studies revealed the importance of Mano's reformulated approach; traditional Riemannian SGD with momentum (RSGD-M) failed to optimize LLMs effectively, underscoring the benefits of Mano's flexible training dynamics. The rotational Oblique manifold scheme was also crucial, as a static manifold showed significantly worse scaling behavior on larger models. Initial experiments on momentum with or without retraction yielded similar results, suggesting further investigation in this design aspect.
Mano introduces a promising new direction for efficient LLM training by successfully reformulating manifold optimization. It expands the Pareto frontier for LLM training in terms of both space and time efficiency, suggesting that geometrically aware optimization techniques, when combined with modern strategies, hold significant potential. The findings contribute to making LLM training more sustainable and accessible. Future research directions include comprehensive hyperparameter fine-tuning, conducting over-training experiments with even larger models, and extending the theoretical convergence analysis to fully encompass momentum dynamics and broader optimization regimes present in the full Mano optimizer.