Research Guy

Problem

Existing augmented LLM inference systems suffer from two main issues: (i) First-Come-First-Served (FCFS) scheduling causes severe head-of-line (HoL) blocking, leading to excessive queuing delays that violate service-level objectives (SLOs). (ii) Static batch token limits fail to adapt to fluctuating workloads and hardware conditions, resulting in suboptimal throughput and resource waste. Both factors significantly degrade the effective throughput and overall service quality for augmented LLM applications.

Method

AugServe introduces a two-stage adaptive request scheduling strategy combined with a dynamic token-level batching mechanism. The first stage estimates a provisional scheduling value for incoming requests based on predicted output length, API call duration, and context handling policy. The second stage refinements this value using runtime information, such as actual API return lengths, and incorporates an anti-starvation mechanism to ensure fairness. Concurrently, AugServe dynamically adjusts the batch token limit by monitoring available GPU memory and paused request context memory, with bounded constraints to maintain system stability and maximize throughput under varying loads.

Results

Experimental evaluations show that AugServe significantly outperforms baseline systems like vLLM and InferCept across various hardware, models, and datasets. AugServe achieves 4.7-33.1x higher effective throughput than vLLM and 3.3-13.2x higher than InferCept. Furthermore, it reduces the Time-to-First-Token (TTFT) by up to 96.3% compared to vLLM and 95.0% compared to InferCept, indicating significantly lower queuing delays. AugServe also demonstrates higher per-token efficiency with 80.3% and 62.7% lower normalized latency than vLLM and InferCept, respectively, and maintains superior stability under bursty traffic and high loads.

Implications

AugServe's innovations in adaptive request scheduling and dynamic batching provide a crucial advancement for augmented LLM inference services. By drastically reducing queuing latency and boosting effective throughput, AugServe directly translates to a superior user experience in web applications powered by augmented LLMs. Its ability to adapt to diverse and fluctuating workloads ensures stable and efficient service quality, positioning it as a key infrastructure component for next-generation web services that rely on complex, tool-augmented language models.

Research Guy

Understand New Research — Instantly

Daily AI-generated explanations of the latest arXiv papers.

Research Guy

Research Guy

AugServe: Adaptive Request Scheduling for Augmented Large Language Model Inference Serving

Problem

Method

Results

Implications