AI Summary • Published on Jan 27, 2026
The increasing use of generative AI for creating educational diagrams, particularly in computing education, faces a significant challenge: the inconsistent quality of generated materials, often plagued by hallucinations. Manually creating diagrams is time-consuming, pushing educators towards AI, but the reliability of these AI-generated diagrams remains a major concern. Current in-context learning (ICL) approaches for diagram code generation frequently result in verbose outputs, poor layout quality, and factual or faithfulness hallucinations, hindering their utility in learning environments.
This study proposes an innovative ICL method for diagram code generation, guided by Rhetorical Structure Theory (RST). The approach aims to enhance diagram quality and minimize hallucinations by simplifying and tailoring prompt examples. Large Language Models (LLMs) are instructed to perform RST analysis on source texts, then select relevant in-context examples based on the identified rhetorical structures. Two variations, RST1 and RST2, are developed and benchmarked against a zero-shot generation pipeline. The generation process involves several stages: initial RST analysis, selection of a suitable example, diagram code generation in Graphviz's dot syntax, automated error repair, and a final refinement step. The effectiveness of the method is evaluated by computer science educators who assess 150 generated diagrams using a rubric that scores logical organization, connectivity, and layout aesthetic. They also meticulously identify instances of factual and faithfulness hallucinations. Additionally, the paper explores automated diagram evaluation methods, comparing implicit learning, explicit learning with instructions, and instruction-based reflection.
The preliminary findings indicate that the RST-based ICL method, particularly RST1 when used with the o3 model, successfully reduces the rate of factual hallucination and improves diagram faithfulness to the provided context. Across various metrics, the o3 model consistently outperformed GPT-4o. However, the study observed that LLMs struggled with generating high-quality diagrams from more complex or advanced input texts, which tended to result in higher hallucination rates and lower scores for logical organization. The research also revealed that multi-step generation pipelines, especially RST2, are more susceptible to inherited hallucinations propagating through stages. While LLMs showed limited effectiveness in self-evaluating layout quality with simple prompts, their assessments moderately aligned with human evaluations when provided with explicit instructions and examples from the evaluation dataset.
This research offers a novel RST-based ICL method that shows promise in reducing factual hallucinations and context inconsistencies in AI-generated educational diagrams. The findings underscore the inherent challenges of LLM stochasticity and the models' propensity for hallucination, particularly when handling complex textual inputs. The study emphasizes the critical need for implementing robust safeguards and careful consideration when deploying LLMs in educational settings to ensure the reliability and pedagogical utility of AI-generated materials for both teachers and students. Future work should focus on increasing the sample size, expanding the set of in-context examples to cover a broader range of coherence relations, and further investigating the limitations of automated evaluation due to LLMs' lack of causal understanding.