All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 1 results for this tag.
OD-MoE: On-Demand Expert Loading for Cacheless Edge-Distributed MoE Inference
This paper introduces OD-MoE, a distributed Mixture-of-Experts (MoE) inference framework designed for memory-constrained edge devices. It enables fully on-demand expert loading without a cache, achieving high decoding speeds and significantly reducing GPU memory requirements while maintaining full model precision.