AI Summary • Published on Jan 14, 2026
The rapid advancement of AI towards autonomous scientific discovery, or "agentic science," faces a significant challenge: ultra-long-horizon autonomy. Large Language Models (LLMs) excel at short-term reasoning but become overwhelmed by the vast execution details and delayed feedback inherent in real-world research, such as Machine Learning Engineering (MLE). Existing context management approaches often aggregate information linearly or through simple summarization, failing to structurally differentiate between immediate execution traces and the stable, long-term strategic insights necessary to sustain coherence over experimental cycles spanning days or weeks. This limitation prevents LLM-based agents from effectively accumulating and leveraging experience for complex, prolonged scientific exploration.
The authors propose ML-Master 2.0, an autonomous agent designed for ultra-long-horizon Machine Learning Engineering (MLE) tasks, which conceptualizes context management as a process of "cognitive accumulation." At its core is the Hierarchical Cognitive Caching (HCC) architecture, drawing inspiration from computer memory systems. HCC comprises two main elements: hierarchical caching, which structures context into three distinct tiers based on their temporal stability and reusability, and context migration, which governs the dynamic flow of information between these tiers. The three cache levels are: Evolving Experience (L1) for immediate, high-fidelity execution traces; Refined Knowledge (L2) for intermediate, stabilized insights from completed experimental phases; and Prior Wisdom (L3) for task-agnostic, transferable strategies learned from past tasks. Context migration mechanisms include prefetching relevant prior wisdom from L3 at the outset of a task, a "context hit" policy that prioritizes raw L1 data before falling back to L2 summaries, and a "context promotion" operator. This promotion operator, split into phase-level (P1) and task-level (P2) components, uses LLMs to abstract raw execution traces into refined knowledge (L2) and distill comprehensive wisdom into L3, ensuring efficient and sustained long-horizon exploration.
ML-Master 2.0 was evaluated on OpenAI's MLE-Bench using a 24-hour execution budget, with performance measured by the average medal rate (Bronze, Silver, or Gold). The agent achieved a state-of-the-art overall medal rate of 56.44%, representing a 92.7% relative improvement over its predecessor, ML-Master. Performance gains were observed across all task complexities, with low-complexity tasks improving to 75.76%, medium-complexity to 50.88%, and high-complexity to 42.22%. ML-Master 2.0 also significantly outperformed existing baselines, achieving an 11.2% relative improvement over the best closed-source method (from 50.7% to 56.4%) and a 60.7% improvement over the leading open-source method (from 35.1% to 56.4%). It demonstrated robustness with a 95.6% valid submission rate and surpassed the top 50% of human participants in 63.1% of tasks. Ablation studies confirmed the critical role of each HCC layer: removing Evolving Experience (L1) drastically reduced the medal rate to 22.7%, excluding Refined Knowledge (L2) hindered top-tier performance, and omitting Prior Wisdom (L3) degraded performance by impacting effective exploration. The HCC architecture effectively managed context length, reducing peak token usage from over 200k to approximately 70k.
This research establishes a foundational paradigm for agentic science, demonstrating that ultra-long-horizon autonomy is achievable through structured cognitive accumulation rather than mere context window expansion. ML-Master 2.0's Hierarchical Cognitive Caching (HCC) architecture offers a scalable blueprint for AI, enabling agents to sustain strategic coherence and iterative correction over prolonged experimental cycles. By dynamically distilling transient experiences into stable knowledge and reusable wisdom, the system effectively manages context saturation, a critical bottleneck in deploying LLM-based agents in complex, high-dimensional, and delayed-feedback environments characteristic of scientific discovery. The findings suggest that this approach can empower autonomous agents to orchestrate the entire scientific lifecycle, facilitating exploration and breakthroughs beyond current human capabilities.