All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 1 results for this tag.
AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving
AugServe is an efficient inference framework that addresses the challenges of serving augmented large language models by introducing a two-stage adaptive request scheduling strategy and a dynamic token-level batching mechanism. It significantly reduces queuing latency and enhances effective throughput, leading to improved user experience in web applications.