AI Summary • Published on Dec 2, 2025
Modern industrial automation needs flexible control strategies due to dynamic environments. Large Language Model (LLM) agents show promise for adaptive planning and execution, but a significant challenge is the lack of standardized benchmarks to systematically compare their performance. Existing benchmarks often focus on static planning or tool execution separately, lacking provisions for real-time execution, plan validation, and adaptability to environmental changes, particularly in dynamic, partially observable, or constrained scenarios.
The authors introduce a benchmark based on the classic Blocksworld domain to evaluate LLM-based agents. It features an executable simulation environment with scenarios categorized into five increasing levels of complexity: Basic, With non-constructive actions, Impossible, Additional Constraints (like block size), and Partial Observability. The benchmark integrates the Model Context Protocol (MCP) as a standardized tool interface, allowing various LLM agent architectures to interact with the simulation without custom modifications. The simulation provides a REST API, which is then wrapped by an MCP server to expose tools for information retrieval (e.g., get_rules, get_status), plan verification (verify_plan), and execution of primitive Blocksworld actions (pick_up, put_down, stack, unstack). This modular design facilitates direct comparison between different LLM agents and even classical symbolic planners.
A single-agent system, implemented as a ReAct agent using OpenAI's o3 model via LangGraph, was evaluated across 50 predefined scenarios. The benchmark successfully differentiated complexity levels, with success rates declining from 80% in basic scenarios (Category 1) to 60% in partially observable scenarios (Category 5). The agent achieved 100% accuracy in identifying impossible scenarios (Category 3). Performance metrics like execution time, planning attempts, and token consumption significantly increased with complexity. For instance, Category 1 tasks averaged 76 seconds and 35,100 tokens, while Category 5 averaged 676 seconds, 3.1 attempts, and 192,000 tokens. Observed failure modes included generating invalid intermediate steps, constraint violations during execution, and premature termination, particularly in scenarios requiring extensive non-constructive actions or dealing with partial observability.
The introduced benchmark offers a robust platform for systematically evaluating LLM-based agents in planning and execution within dynamic environments, bridging the gap between static symbolic benchmarks and dynamic agent evaluation frameworks. It provides quantitative metrics for comparing diverse agent architectures and highlights specific areas for future architectural improvements in LLM agent design. Future work includes systematically comparing different LLM agent architectures (single-agent, multi-agent, hybrid), extending scenarios with dynamic events (runtime errors, changing goals), incorporating additional real-world constraints (block weight, material properties), and developing multi-robot scenarios, as well as scenarios with ambiguous or incomplete initial specifications.