AI Summary • Published on Apr 29, 2026
Classical world models, typically relying on flat tensor representations, suffer from several significant drawbacks. These include high sensitivity to noise in observations, the accumulation of errors over extended prediction horizons, and limited capabilities for complex reasoning due to a lack of understanding of object interaction laws. While recent research has begun leveraging graph structures to address these limitations by modeling environments in a more structured space, there has been a lack of a cohesive definition and comprehensive survey unifying these emerging "graph-based world models" as a distinct research paradigm.
This survey formally defines Graph World Models (GWMs) as an extension of traditional world models, where the environment is represented by a graph structure (nodes and edges). GWMs inject relational inductive biases (RIBs) through two core operations: structural abstraction, which constructs graphs from observations, and relational transition, which models dynamic changes within the graph structure. The paper introduces a novel three-layer taxonomy for GWMs, categorizing them based on the specific RIBs they incorporate:
The first category, "Graph as Connector," focuses on spatial RIBs. These models abstract environmental observations into topological frameworks, such as landmarks and their connections, to capture reachability and connectivity. This approach aims to compress high-dimensional trajectory data and transform continuous space searches into more efficient graph retrieval operations, primarily addressing noise sensitivity and enabling long-term navigation.
The second category, "Graph as Simulator," injects physical RIBs. These models distill complex physical laws into structured interaction rules to predict how entities interact over time. By focusing on physical interactions like collisions and friction, they aim to reduce error accumulation by abstracting away pixel-level details and modeling dynamic processes at an object or system level.
The third category, "Graph as Reasoner," uses logical RIBs. Models in this category transform environmental dynamic patterns into stable logical structures, where nodes represent cognitive concepts or causal factors and edges denote semantic constraints or causal associations. This enables advanced capabilities such as instruction following, high-level reasoning, and counterfactual prediction by extracting normative semantic protocols or invariant causal skeletons.
The survey comprehensively reviews representative models within each proposed category of GWMs. For the "Graph as Connector" category, various methods are discussed, including those that construct explicit spatial topologies (e.g., SPTM for path planning, L3P for landmark identification, CityNavAgent for city-scale navigation) and those that build implicit experiential memory (e.g., WGD for hierarchical RL, VMG for long-distance control, G4RL for sub-target mapping). These models demonstrate improved efficiency in navigation and planning by leveraging graph-based spatial relationships.
In the "Graph as Simulator" category, the paper examines object-centric interaction models (e.g., C-SWM and G-SWM for modeling entity interactions, ROCA for improved sample efficiency, Dyn-O for complex environments) which focus on discrete entities. It also covers system-oriented interaction models (e.g., VDFD for multi-agent tasks, RoboPack for robot interactions, HD-VPD for high-fidelity material simulation), which address complex, interconnected environments. These models effectively distill physical laws to predict dynamic changes and reduce error accumulation.
For the "Graph as Reasoner" category, the survey explores models that develop normative semantic protocols (e.g., Worldformer for textual environments, S3 and YuLan-OneSim for social simulation, GWM for multimodal integration) and those that extract invariant causal skeletons (e.g., FANS-RL for non-stationary properties, VCD for hidden space causal graphs, CCSA for open-world environments). These approaches enhance reasoning capabilities by uncovering and leveraging logical and causal structures.
The paper outlines several critical challenges and promising future research directions for Graph World Models. Firstly, current GWMs struggle with dynamically adapting to non-stationary real-world environments, necessitating research into topological plasticity that allows graph structures to adjust online. Secondly, most simulators are deterministic, limiting their ability to model the stochastic nature of real physical processes; future work should explore incorporating probabilistic models like graph diffusion or variational priors for multi-hypothetical inference. Thirdly, reasoner models, especially those using Large Language Models, are prone to logical hallucinations, suggesting the need for differentiable verification layers to align generated symbols with geospatial knowledge or physics engines.
Additionally, there is a recognized need for multi-granularity inductive biases, potentially through multi-layer graph architectures, to enable GWMs to jointly model logical, physical, and low-level spatial interactions for more robust long-term planning. Finally, the field requires dedicated benchmarks and evaluation metrics that go beyond simple task-level success rates. These new benchmarks should assess graph-specific properties such as fidelity, relational transition accuracy, long-horizon stability, generalization under distribution shift, and efficiency in construction, update, retrieval, and planning, providing more diagnostic insights into model failures.