AI Summary • Published on Jan 28, 2026
Current approaches to image generation and editing often treat these tasks in isolation, leading to several inefficiencies and challenges. Key issues include difficulty in maintaining spatial consistency and semantic coherence between generated content and subsequent edits. Furthermore, a significant hurdle is the lack of structured control over how objects relate to each other and their spatial arrangement within a scene. While scene graph-based methods offer a promising solution by representing objects and their interrelationships in a structured format, they frequently require additional fine-tuning or training for editing tasks, increasing computational costs. The core problem SimGraph aims to solve is the need for a unified, efficient framework that seamlessly integrates both image generation and editing while ensuring high quality and semantic consistency, particularly when dealing with complex scene graphs.
SimGraph proposes a unified framework that combines scene graph-based image generation and editing into a single pipeline. The framework begins with scene graph extraction from an input image using a Multimodal Large Language Model (e.g., Qwen-VL-2.5-7B), which identifies objects and their relations. These relations are then refined into a concise, ordered caption by pruning redundant information and sorting triplets based on a salience score to emphasize key entities. For image generation, SimGraph synthesizes new images directly from this refined scene graph caption using a Visual AutoRegressive (VAR) model, conditioned on text embeddings from a frozen CLIP encoder. For image editing, the framework takes an original image and an edited scene graph (derived from user instructions). It constructs two complementary prompts: a source prompt to preserve existing content and a target prompt to prioritize edits. These prompts are then used with a diffusion-based editor (instantiated with LEDIT++ and Stable Diffusion) under joint conditioning. This process involves latent inversion to reconstruct unchanged regions and a denoising trajectory with classifier-free guidance, blending source and target paths to stabilize background structure while driving modifications. The entire process is unified under a single formulation driven by graph-derived controls, allowing the model to switch between generation and editing based on input availability.
SimGraph demonstrates superior performance compared to existing scene graph-based methods in both image generation and editing tasks. In image editing experiments conducted on the EditVal dataset, SimGraph achieved a fidelity score of 0.87, outperforming SGEdit (0.83) and SIMSG (0.57), indicating its ability to preserve fine details during modifications. Critically, SimGraph significantly improved runtime for editing, completing tasks in 20-30 seconds per image, a substantial improvement over SGEdit's 6-10 minutes. For image generation, leveraging a pre-trained SATURN model within the framework, SimGraph achieved better FID (21.62) and IS (24.78) scores on the Visual Genome dataset compared to SG2IM and SGDiff, showcasing high-quality generation. Competitive CLIP scores (20.98) were also observed on the COCO dataset, confirming strong semantic alignment. Qualitative evaluations further illustrate the framework's success in accurately capturing and editing spatial and relational semantics, such as replacing a "bear" with a "wolf" in a forest scene. However, failure cases were noted where complex scene graph modifications, especially those involving multiple elements or intricate background changes, led to inaccuracies and difficulties in maintaining spatial consistency.
SimGraph provides a significant advancement in generative AI by offering a unified and efficient framework for both image generation and editing, leveraging the structured control of scene graphs. This integration addresses critical challenges in maintaining semantic coherence and spatial consistency, which are often problematic when these tasks are treated separately. The framework's ability to provide fine-grained control over object interactions and layouts through a single scene graph-driven model leads to higher fidelity, consistency, and reduced computational overhead, making it suitable for real-time applications. The demonstrated improvements in both quantitative metrics and qualitative results suggest that SimGraph has the potential to enhance creative workflows in design, entertainment, and education by offering a more intuitive and powerful tool for visual content manipulation. Future work will focus on improving its robustness for handling more complex, dynamic scenes and exploring the integration of multimodal inputs for even more flexible user interactions.