AI Summary • Published on Dec 3, 2025
Long context reasoning in Large Language Models (LLMs) often relies on Reinforcement Learning with Verifiable Rewards (RLVR), which faces several limitations. These include sparse rewards, inadequate sample efficiency, and significant computational resource requirements during the post-training phase. RLVR struggles to provide dense feedback, treating a nearly correct response with a minor error the same as a completely wrong one, potentially leading to reward hacking or narrowing the model's fundamental reasoning capabilities rather than truly enhancing them. Existing alternatives also have their own drawbacks, often still relying on forms of process supervision or complex iterative training.
The authors propose Semantic Soft Bootstrapping (SSB), an RL-free self-distillation framework. In SSB, a single base LLM acts as both teacher and student. For each math problem, the model first generates multiple rollouts, which are then categorized as correct or incorrect based on the final answer. A "teacher" prompt is then constructed, providing the original problem along with a representative correct solution and the most common incorrect solution. The base model, acting as the teacher, synthesizes a single, detailed, robust, and error-aware explanation from this hinted context. This process creates paired teacher-student training data without human intervention. Crucially, the teacher's token-level logits for the answer portion are extracted and stored as soft labels. During the training phase, the student model (the same base model with LoRA adapters) receives only the raw question and is optimized to match the teacher's token distribution via a temperature-scaled KL divergence loss, without using cross-entropy or explicit reinforcement learning. This logit-level supervision allows the student to learn robust, step-by-step reasoning without direct hints during inference.
Experiments were conducted by fine-tuning the Qwen2.5-3B-Instruct model on a curated dataset of 256 samples from GSM8K. The SSB-trained model was evaluated against a Group Relative Policy Optimization (GRPO) baseline on MATH500 and AIME2024 benchmarks. SSB demonstrated significant improvements, achieving a 10.6% increase in accuracy on MATH500 and a 10% increase on AIME2024 compared to GRPO. The training dynamics for SSB showed a stable decrease in loss and gradient norm, indicating convergence. Interestingly, SSB training did not lead to a systematic increase in completion length, suggesting that enhanced reasoning capabilities are not necessarily tied to longer generated responses or increased token usage.
Semantic Soft Bootstrapping offers a promising, compute-efficient, and RL-free alternative for improving long-context reasoning in LLMs. By leveraging self-distillation and logit-level supervision from semantically rich contexts, SSB overcomes key limitations of traditional RLVR methods, such as sparse rewards and high computational costs, while also mitigating issues like reward hacking and performance collapse. The stable training dynamics and performance gains on challenging math benchmarks suggest that this approach can effectively distill richer internal semantics into model parameters. The authors believe this work can be scaled to larger models and diverse domains, paving the way for further research into its sample efficiency and scaling laws compared to modern RLVR pipelines.