All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 1 results for this tag.
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
The paper introduces a formal resource‑allocation framework for serving large transformer‑based models with pipeline parallelism, proposing greedy placement, cache allocation, and load‑balancing algorithms that drastically cut inference latency.