AI Summary • Published on Apr 15, 2026
Current AI‑generated webpage tools create HTML/CSS but treat images, videos, and charts as isolated placeholders or retrieved assets. This leads to style inconsistency, geometry mismatches, and overall page incoherence because multimodal elements are not coordinated with the global layout.
MM‑WebAgent introduces a hierarchical agentic framework. First, a global layout plan defines sections, spatial organization, and style attributes together with placeholders for multimodal components. Then, local element plans are created for each asset, specifying functional role, size constraints, and style guidance. Generation uses AIGC models for images, videos, and charts. After initial synthesis, a three‑level self‑reflection loop iteratively refines (i) individual assets, (ii) the surrounding HTML/CSS context, and (iii) the entire page using rendered screenshots, ensuring coherence at all scales.
On the newly proposed MM‑WebGEN‑Bench (120 curated webpages), MM‑WebAgent achieves the highest average score of 0.75, surpassing code‑only and other agent baselines across global metrics (layout, style, aesthetics) and local multimodal metrics (image, video, chart quality). Ablations show that removing hierarchical planning or reflection significantly drops performance, confirming the importance of each component.
The work demonstrates that treating multimodal webpage generation as a coordinated, hierarchical design task yields more visually consistent and semantically aligned pages. The benchmark and evaluation protocol also provide a standardized way to assess future multimodal web agents.