Research Guy

Problem

Current AI‑generated webpage tools create HTML/CSS but treat images, videos, and charts as isolated placeholders or retrieved assets. This leads to style inconsistency, geometry mismatches, and overall page incoherence because multimodal elements are not coordinated with the global layout.

Method

MM‑WebAgent introduces a hierarchical agentic framework. First, a global layout plan defines sections, spatial organization, and style attributes together with placeholders for multimodal components. Then, local element plans are created for each asset, specifying functional role, size constraints, and style guidance. Generation uses AIGC models for images, videos, and charts. After initial synthesis, a three‑level self‑reflection loop iteratively refines (i) individual assets, (ii) the surrounding HTML/CSS context, and (iii) the entire page using rendered screenshots, ensuring coherence at all scales.

Results

On the newly proposed MM‑WebGEN‑Bench (120 curated webpages), MM‑WebAgent achieves the highest average score of 0.75, surpassing code‑only and other agent baselines across global metrics (layout, style, aesthetics) and local multimodal metrics (image, video, chart quality). Ablations show that removing hierarchical planning or reflection significantly drops performance, confirming the importance of each component.

Implications

The work demonstrates that treating multimodal webpage generation as a coordinated, hierarchical design task yields more visually consistent and semantically aligned pages. The benchmark and evaluation protocol also provide a standardized way to assess future multimodal web agents.

Research Guy

Understand New Research — Instantly

Daily AI-generated explanations of the latest arXiv papers.

Research Guy

Research Guy

MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation

Problem

Method

Results

Implications