All Tags
Browse through all available tags to find articles on topics that interest you.
Browse through all available tags to find articles on topics that interest you.
Showing 2 results for this tag.
Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol
This paper introduces a new benchmark for evaluating Large Language Model (LLM) agents in planning and execution tasks within industrial automation. It uses the Blocksworld problem with five complexity categories and integrates the Model Context Protocol (MCP) as a standardized tool interface, enabling systematic comparison of diverse LLM agent architectures.
Evaluating Long-Context Reasoning in LLM-Based WebAgents
This paper introduces a benchmark for evaluating long context reasoning capabilities of WebAgents through sequentially dependent subtasks that require retrieval and application of information from extended interaction histories. It observes a dramatic performance degradation as context length increases and proposes an implicit RAG approach for modest improvements.