AI Summary • Published on Dec 2, 2025
The increasing integration of large language model (LLM)-based agents into daily digital interactions makes their ability to reason across long interaction histories critically important for personalized and contextually aware assistance. However, the performance of these action-taking WebAgents in long-context scenarios within realistic web environments remains largely unexplored. While existing research has focused on personalizing LLMs for chatbots and information retrieval, there is a clear gap in understanding and evaluating the long-term interaction capabilities of WebAgents.
This paper introduces a novel benchmark designed to evaluate the long-context reasoning abilities of WebAgents on long-horizon, sequential tasks. The benchmark features a dataset of sequentially dependent subtasks and a unique methodology that simulates extended user interactions by injecting irrelevant task trajectories into the agent's context. These injected contexts range from 25,000 to 150,000 tokens. Tasks are sourced from the WebCanvas dataset and meticulously curated to ensure they are executable and adhere to specific properties: the second subtask (A2) is dependent on the first (A1), underspecified by an attribute from A1, and unambiguous to prevent reliance on injected tasks. The study evaluates four prominent models: Claude-3.7, GPT-4.1, Llama 4, and o4-mini. Task success is assessed using rule-based "key steps" from WebCanvas, rather than potentially fluctuating LLM-based judgments. Furthermore, the paper proposes an implicit Retrieval-Augmented Generation (iRAG) approach, where the WebAgent generates task-relevant summaries to simplify complex instructions, which are then appended to its context to aid reasoning.
The evaluation revealed a significant performance drop for all tested WebAgents as context length increased. Success rates, initially between 40-50% in baseline conditions (no noise), plummeted to under 10% in long-context scenarios. Detailed error analysis identified that agents primarily failed by getting stuck in repetitive loops and losing sight of the original task objectives. Task efficiency also sharply decreased with longer contexts. The o4-mini model demonstrated relatively better performance, suggesting that reasoning-focused models might have an advantage in these complex tasks. The proposed implicit RAG approach yielded modest performance improvements for Claude-3.7, GPT-4.1, and o4-mini at the 150k context length, indicating that decomposing complex instructions through summarization can be beneficial. However, despite these gains, overall success rates remained low, highlighting persistent fundamental limitations in long-context reasoning. Common failure modes included prematurely ending tasks (False End), hitting step limits due to loops or inefficient progress, and other technical errors like timeouts.
These findings expose critical challenges in deploying current LLM-based WebAgents in real-world, long-term user interaction environments. The dramatic performance degradation and prevalent failure modes, such as looping and loss of task objectives in long contexts, underscore an urgent need for advancements in agent architecture. Future research and development should prioritize creating more robust memory architectures, implementing improved context filtering mechanisms, and enhancing planning capabilities. Such improvements are essential to enable WebAgents to maintain coherent task execution and provide reliable assistance across the complex and information-rich extended interaction histories that characterize realistic user scenarios. The study acknowledges limitations including the time-consuming nature of long-context experiments, potential for unrealistic subtask generation, the dynamic nature of the live internet, and the inherent non-determinism of LLMs affecting reproducibility.