AI Summary • Published on Apr 15, 2026
Requirements engineering relies heavily on expert judgment to assess quality attributes such as clarity, completeness, and testability. While AI tools, especially large language models, have shown promise for automating parts of this process, it is unclear how well they can replicate the nuanced evaluations performed by experienced systems engineers and how they align with INCOSE criteria.
The authors conducted a controlled study comparing AI‑assisted evaluation with human expert assessment. Two datasets were used: a real‑world medical inventory system (DR Tool) and the public PROMISE software‑requirements corpus. Prompts were designed for ChatGPT‑4, Claude Sonnet 3.5, and Llama 3 to assess requirements against seven INCOSE‑aligned criteria and to classify functional versus non‑functional requirements. Human engineers evaluated the same items, and a survey of 21 engineers provided individual judgments for deeper comparison.
AI models delivered fast, consistent preliminary assessments, excelling at syntactic and structural checks. Claude Sonnet 3.5 achieved the highest agreement with engineers (≈85% accuracy) on quality criteria, while GPT‑4 and Llama 3 showed lower and more variable performance. For functional vs. non‑functional classification, GPT‑4o reached about 85% accuracy, but all models displayed systematic biases toward either functional or non‑functional labels. Human experts remained essential for interpreting ambiguity, feasibility, and contextual trade‑offs.
The findings suggest AI can serve as an effective decision‑support layer in RE, handling routine linguistic audits and preliminary classification, thereby reducing engineer workload. However, AI should not replace expert judgment for high‑level reasoning, feasibility assessment, and ambiguity resolution. Integrating AI‑copilots into a three‑step workflow—AI pre‑audit, human review, and expert validation—can improve efficiency while preserving traceability and accountability in systems engineering.