AI Summary • Published on Apr 15, 2026
Large foundation models, such as LLMs, require massive GPU memory both for model parameters and for per‑request KV caches. When these models are served with pipeline parallelism, each inference request must traverse a chain of servers that collectively host all model blocks. Existing serving systems lack a principled way to decide how to place blocks across heterogeneous servers, how much cache to reserve, and how to dispatch jobs, leading to inefficient resource use and high response times.
The authors formalise the “server‑chain composition” problem as a joint optimisation of block placement, cache allocation, and load‑balancing under a state‑of‑the‑art policy (Join‑the‑Fastest‑Free‑Server). They prove the problem is NP‑hard and propose scalable algorithms: (1) Greedy Block Placement with Cache Reservation (GBP‑CR) that orders servers by speed and places blocks while reserving cache for a chosen concurrency level; (2) Greedy Cache Allocation (GCA) that iteratively builds the fastest feasible server chains using shortest‑path routing; (3) Join‑the‑Fastest‑Free‑Chain (JFFC) as an online dispatcher. Analytical bounds on mean response time are derived, and a one‑dimensional search selects the optimal cache‑reservation parameter.
Simulation and a real‑world PETALS deployment demonstrate that the proposed stack (GBP‑CR + GCA + JFFC) reduces mean response time by 8 %–83 % compared with the baseline PETALS heuristics and by up to 76.8 % relative to a recent BPRR method. The gains are especially pronounced in memory‑constrained settings with few servers or a low fraction of high‑performance GPUs. Experiments also confirm the theoretical performance bounds and show that the solution stays effective even when real traffic deviates from Poisson/exponential assumptions.
By explicitly composing server chains and allocating cache, large‑model serving systems can dramatically lower latency without additional hardware. The framework provides a foundation for future extensions such as dynamic demand‑aware tuning, multi‑objective optimisation, and integration with emerging model‑parallel serving platforms.