AI Summary • Published on Dec 2, 2025
Mixture-of-Experts (MoE) models offer significant advantages for scaling Large Language Models (LLMs) but face substantial challenges when deployed on resource-constrained edge devices, primarily due to their large memory footprint. Existing expert offloading techniques mitigate this by storing expert parameters in CPU memory and caching a subset in GPU memory. However, these methods still lead to underutilized GPU memory and are bottlenecked by the limited bandwidth of the CPU-GPU link, especially on edge devices relying on standard PCIe. Approaches like quantization or expert skipping to alleviate I/O pressure can degrade model performance, and maintaining expert caches requires additional GPU memory, limiting the capacity for large-scale MoE models on edge.
OD-MoE is a distributed MoE inference framework that eliminates the need for expert caches through fully on-demand expert loading. Its core mechanisms include: 1) parallelizing expert loading and expert computation across distributed edge nodes, where different groups of nodes simultaneously handle computation for the current layer and loading for upcoming layers; and 2) an ultra-accurate Scaled Emulative Prediction (SEP) scheme. SEP employs a low-cost, quantized "shadow" MoE model that runs in parallel with the full-precision model to forecast expert activations multiple layers ahead with up to 99.94% accuracy. The system comprises a main node for non-expert components, a shadow node for SEP, and worker nodes for dynamic expert loading and computation, organized with worker-node grouping and round-robin scheduling. To ensure prediction accuracy over extended decoding sequences, SEP incorporates KV cache and token alignment mechanisms, periodically synchronizing the shadow model's state with the full-precision model to prevent cumulative errors. For the prefilling stage, OD-MoE loads all experts in parallel across worker nodes and uses mini-batching to pipeline computation and communication, improving GPU utilization over slower edge networks.
OD-MoE achieved an expert activation prediction accuracy of 99.94% with FP16 quantization, significantly outperforming existing methods. Benchmarked on a ten-node testbed with Mixtral-8x7B, OD-MoE delivered approximately 75% of the decoding speed of a fully GPU-cached MoE deployment (HuggingFace Transformers) while consuming only one-third of the GPU memory (60GB total). Compared to other expert-offloading baselines, OD-MoE showed superior decoding throughput, outperforming them by factors ranging from 1.18x to 5.37x. Importantly, OD-MoE preserved full-precision answer quality, consistently surpassing expert-offloading baselines, which often introduce precision losses, across six representative LLM performance benchmarks including General Knowledge, Math, Reasoning, Coding, Instruction Following, and Anti-Hallucination.
OD-MoE significantly reduces GPU memory requirements, bringing the per-worker GPU memory footprint to under 1 GB. This reduction enables practical MoE inference on low-cost edge GPUs and even IoT-class devices, such as Wi-Fi routers or webcams, thus lowering hardware costs by over threefold compared to fully GPU-cached deployments. Beyond edge scenarios, OD-MoE's SEP scheme and parallel expert loading mechanism offer benefits for data center operations by facilitating on-demand expert replication and allowing more cost-effective utilization of GPUs with smaller memory capacities. The open-sourcing of OD-MoE provides a foundation for further advancements in distributed MoE inference systems.