AI Summary • Published on Mar 24, 2026
Agentic artificial intelligence (AI) in organizational workflows presents significant challenges beyond simple task competence, particularly concerning reliability and the associated costs of human oversight. When deterministic processes are replaced by stochastic AI policies, the central issue shifts from the plausibility of a single action to the statistical support and governability of entire action trajectories. This mismatch, termed the "stochastic gap," has led to practical problems such as incorrect information from chatbots, operational errors, and project cancellations due to escalating costs or inadequate risk management. Therefore, auditing an agent's ability to maintain reliable paths through a workflow, rather than just producing locally plausible actions, is a critical pre-deployment requirement.
The authors propose a measure-theoretic Markovian framework to address the stochastic gap by enabling the audit of agentic AI from event-log data. This framework consists of four key elements: First, it defines "state blind-spot mass" and "state-action blind mass" as finite-sample measures of deployment mass that falls outside or at the edge of historical support. Second, it incorporates Shannon entropy and a reproducible risk weighting to establish a deployment-side escalation rule for human intervention. Third, it demonstrates that this reliability gate also defines an expected oversight-cost identity, directly linking reliability and economic burden. Finally, the framework is validated on the Business Process Intelligence Challenge 2019 (BPI 2019) purchase-to-pay log, which contains 251,734 cases and 1.6 million events. A log-driven simulated agent is constructed from an 80/20 chronological split of this data to compare the theoretical reliability surrogates against realized step and case outcomes, using a refined state representation that includes activity, item type, goods receipt flag, value bin, and actor class.
The descriptive audit on the BPI 2019 log showed that refining the operational state significantly expanded the state space from 42 to 668 and state-action space from 498 to 3262. While state occupancy appeared well-covered, substantial state-action blind mass was found, with over 12% of transition mass having fewer than 1000 historical examples. This highlights that support over next-step decisions (actions) is more critical for justifying agentic execution than support over states alone. Highest-entropy states were identified in human-handled approval and exception management contexts. The analysis of the autonomy envelope revealed a significant path compounding effect: a workflow might appear largely autonomous at the step level but considerably less so end-to-end. In the held-out agent study, the theoretical surrogate for step accuracy tracked realized autonomous step accuracy within an average of 3.4 percentage points. The safe completion surrogate was conservative but directionally accurate, and human touches per case decreased with more permissive gates. Zero-touch completion, however, remained significantly below overall case autonomy, quantifying the reliability lost when autonomy is widened without corresponding improvements in local predictive certainty.
The research demonstrates that reliability and operational cost are interdependent constraints for agentic AI in enterprises. Stricter reliability gates, while increasing safe completion, necessitate heavier human oversight costs. Conversely, more permissive gates reduce oversight burden but expose the system to a greater risk of autonomous errors in higher-entropy regions. The findings suggest a crucial recommendation for enterprise agent deployment: organizations should initiate the process with a comprehensive audit of support, entropy, risk, and oversight costs, rather than focusing solely on prompt engineering. This pre-deployment audit allows for the measurement of the "stochastic gap" and provides an empirical reliability-cost frontier, enabling organizations to determine where full autonomy is justified, where human intervention is most economically dominant, and where additional training data or workflow redesign would be most impactful.