AI Summary • Published on Dec 3, 2025
Existing unified multimodal large language models (MLLMs) and their chain-of-thought (CoT) applications for text-to-image (T2I) generation often suffer from significant limitations. Many current approaches either treat the MLLM as a simple generator or rely solely on abstract textual planning, which is inherently too vague and coarse-grained to guide the generation of complex visual details. A fundamental challenge in T2I models is the difficulty in accurately generating images in a single pass, especially when dealing with rare attribute combinations (e.g., "a white orange"). This issue stems from the natural biases in training datasets, where such unusual combinations are underrepresented, leading models to struggle with correctly binding attributes to objects without strong prior associations. Furthermore, existing models often lack the precise control and flexible editing capabilities needed to correct flaws, modify layouts, or manipulate instances within a generated image effectively.
DraCo proposes a novel interleaved Chain-of-Thought (CoT) reasoning paradigm that fully leverages the unified architecture of MLLMs for enhanced text-to-image generation. The method involves three core steps: draft sketching, draft verification, and corrective refinement. First, a low-resolution draft image (e.g., 384x384) is generated. This draft serves as concrete visual planning, efficiently outlining the basic semantics without requiring high-resolution details upfront. Second, the model's inherent visual understanding capabilities are employed for draft verification. It encodes the draft image using only ViT features (excluding VAE features to allow for more substantial, less constrained changes) and compares it against the input prompt to identify any semantic misalignments. The model then summarizes the necessary edits. Finally, during corrective refinement, the unified MLLM generates the high-resolution, detailed final image by upscaling the draft and applying the corrections identified in the verification step. To support this intricate reasoning process, DraCo-240K, a comprehensive training dataset comprising over 240,000 interleaved reasoning instances, was curated. This dataset is specifically designed to teach three atomic correction capabilities: general correction, instance manipulation, and layout reorganization, utilizing a synergistic pipeline of MLLMs, advanced editing, and segmentation models. Additionally, DraCo introduces DraCo-CFG, a specialized classifier-free guidance strategy that explicitly strengthens two critical conditions during final generation: the visual semantics derived from the draft image and the specific correction instructions from the verification process.
Extensive experiments demonstrate DraCo's significant superiority over existing generation-only and CoT-powered methods. DraCo achieved an impressive 8% improvement on the GenEval benchmark, outperforming text-CoT-based methods by 4%. On ImagineBench, which focuses on challenging unusual object-attribute combinations, DraCo showed a notable improvement of 0.91 points, surpassing text-only reasoning by 0.18 points. Furthermore, it demonstrated advantages on the more complex GenEval++ benchmark with an overall score of 0.40. Ablation studies confirmed the efficacy of DraCo-CFG, which contributed to a 3% increase in the overall GenEval score and resulted in qualitatively superior, clearer images compared to standard CFG. The research also revealed that excluding VAE features from the draft during verification was beneficial, yielding a 2% higher score than when they were included, suggesting that low-level features could constrain necessary modifications. An optimal draft resolution of 384x384 was also identified, balancing efficiency and semantic expressiveness. Qualitatively, DraCo produces satisfying results with high image quality and precise alignment with prompts, successfully addressing issues such as position correction, rare attribute generation, and accurate numeracy.
DraCo offers a groundbreaking direction for enhancing visual generation tasks by effectively integrating interleaved multimodal reasoning within unified MLLMs. By explicitly incorporating visual planning and verification steps, DraCo successfully overcomes the inherent limitations of abstract textual planning and significantly improves the model's ability to generate complex, rare attribute combinations that were previously challenging due to training data biases. This work paves the way for more robust and controllable text-to-image generation systems. Future research could explore extending this versatile paradigm to other modalities, such as video or 3D asset generation, where similar challenges in consistency and planning exist. Additionally, investigating the integration of human-in-the-loop feedback mechanisms for data curation and training loops could further refine and align the generation and correction methodologies.