AI Summary • Published on Dec 3, 2025
Large Language Models (LLMs) excel in reasoning tasks with techniques like Chain of Thought (CoT), but their inference process incurs substantial computational costs. Speculative Decoding (SD) attempts to mitigate this by using a lightweight draft model to propose tokens, which are then verified by a more powerful target model. However, traditional token-level SD struggles in complex reasoning due to low acceptance rates from minor token mismatches, leading to premature rejections and wasted computation. Although recent step-level SD methods address this by verifying entire reasoning steps, existing approaches like Reward-guided Speculative Decoding (RSD) still regenerate many steps unnecessarily, providing little to no quality improvement despite incurring the full computational cost. This inefficiency arises because routing decisions are based on an absolute quality threshold rather than the expected *advantage* the target model would provide over the draft, resulting in significant wasted compute.
Arbitrage introduces a novel step-level speculative generation framework that dynamically routes between a fast draft model and a more capable target model based on the *relative advantage* of using the target. Unlike prior methods that use a fixed acceptance threshold, Arbitrage employs a lightweight "Arbitrage Router" trained to predict when the target model is likely to produce a significantly better reasoning step. The framework is conceptualized around an "Arbitrage Oracle," an idealized policy that always selects the superior step between the draft and target, establishing a theoretical upper bound for routing efficiency. The Arbitrage Router is a practical, lightweight predictive model that approximates this oracle. It is trained offline using data labeled by the oracle to estimate the expected quality difference (advantage) between the target and draft models for a given step, thus avoiding the costly execution of the target model during inference. The training pipeline involves generating a step-level dataset with oracle labels, employing techniques such as class-balanced downsampling to address data imbalance, and incorporating history annotations to provide valuable contextual information. Router quality is evaluated using Spearman's rank correlation, a threshold-invariant metric that assesses how well the router's predictions align with the true oracle advantage scores.
Arbitrage consistently outperforms existing step-level speculative decoding baselines across various mathematical reasoning benchmarks, including MATH500 and OlympiadBench, and diverse model configurations (e.g., LLaMA3 and Qwen2.5-Math with different draft model sizes and quantization levels). Analysis of accuracy versus acceptance rate plots shows that Arbitrage's performance curve consistently lies above RSD, demonstrating superior accuracy for any given target model usage rate, and closely tracking the oracle's theoretical upper bound. The most pronounced gains from Arbitrage are observed when there is a substantial quality gap between the draft and target models. Quantitatively, Arbitrage significantly reduces inference latency: on MATH500 with a quantized-draft regime (Q4-8B/8B), it achieved up to 1.62x lower latency, and on OlympiadBench with a small-draft regime (1B/8B), it demonstrated up to 1.97x speedup at comparable accuracy relative to RSD. Across various settings, Arbitrage reduces end-to-end latency by up to approximately 2x over step-level SD baselines at fixed accuracy targets, showcasing a more favorable compute-quality trade-off by selectively invoking the target model only when a meaningful quality improvement is predicted.
Arbitrage introduces a significant advancement in the efficiency of Large Language Model inference for reasoning-intensive tasks by establishing a new baseline for step-level speculative decoding. By replacing absolute, draft-only acceptance rules with an expected-advantage estimation, Arbitrage provides a more robust and efficient routing mechanism. This framework ensures that expensive target model computations are allocated judiciously, only when they are most likely to yield a substantial improvement in reasoning quality, thereby minimizing computational waste while preserving or enhancing accuracy. The ability to achieve higher accuracy at lower latency makes advanced LLM reasoning capabilities more practical and scalable for real-world applications. This novel advantage-aware approach has broad implications for optimizing the deployment and performance of LLMs in scenarios requiring complex, multi-step Chain-of-Thought reasoning.
Dec 3, 2025