AI Summary • Published on Feb 19, 2026
The core problem addressed is the difficulty in assessing the functional correctness of Artificial Intelligence (AI) systems. Unlike traditional software, AI systems are inherently probabilistic and adaptive, rendering deterministic testing methods largely inadequate. Current quality standards, such as ISO/IEC 25059, provide a quality model but lack practical and statistically robust methods for operationalizing functional correctness evaluation. This gap leads to challenges in defining clear specification limits and selecting representative test cases, with many AI projects reportedly facing high failure rates in production environments.
The paper proposes the Statistical Confidence in Functional Correctness (SCFC) approach, a four-step methodology designed to provide a statistically robust evaluation of AI system performance.
First, defining quantitative specification limits involves translating business needs into objective and measurable acceptance criteria, such as a Lower Specification Limit (LSL) or Upper Specification Limit (USL) for a performance metric (e.g., recall >= 95%).
Second, performing stratified and probabilistic sampling ensures a statistically robust and representative test sample. This involves identifying stratification variables with domain experts to reflect real-world data distribution and probabilistically selecting elements within each stratum.
Third, applying bootstrapping to estimate a confidence interval addresses the probabilistic nature of AI performance. This non-parametric resampling technique simulates multiple new samples from the original test set, generating an empirical distribution of performance metrics and a confidence interval (e.g., 95% CI), which quantifies performance variability.
Finally, calculating a capability index (Cpk) as a final indicator synthesizes the average performance and its variability against the specification limits. The Cpk adapts the Six Sigma Cpk formulation, using the confidence interval bounds instead of standard deviation, making it robust for non-normal AI performance distributions. A Cpk value less than 1.0 is unacceptable, 1.0 is minimum capability, and greater than 2.0 indicates excellent capability.
The SCFC approach was evaluated through a case study involving two real-world AI systems: a cargo deck space estimation system and a credit card fraud detection model.
For the cargo deck space estimation system, with an LSL of 70% prediction acceptance, the observed average was 83.4%. The SCFC approach yielded a 95% confidence interval of [0.7143, 0.9286] and a Cpk of 1.12. This Cpk value, while above the minimum, highlighted that the system's lower performance bound was very close to the specification limit, indicating a small safety margin. The recommendation was for deployment with continuous monitoring.
For the credit card fraud detection model, with an LSL of 98% recall, the observed average was 99.1%. The 95% confidence interval for recall was [0.9855, 0.9967], and the calculated Cpk was approximately 1.98. This Cpk indicated robust performance, comfortably exceeding the specification limit, and the model was deemed suitable for production.
Qualitative feedback from semi-structured interviews with four AI experts revealed overall high suitability and acceptance of the SCFC approach. Experts agreed on the necessity of quantitative specification limits and the robustness of bootstrapping and confidence intervals. Stratified sampling was also seen as valuable but with caveats regarding dataset representativeness and the potential need for oversampling minority classes. The capability index was appreciated as a summary metric for comparing models, though its added value in extreme (non-marginal) cases was debated. The approach was perceived as useful, easy to use (despite a potential learning curve for bootstrapping), and experts showed a strong intention to adopt it.
The SCFC approach introduces a crucial paradigm shift from evaluating AI systems based on single point estimates (e.g., average accuracy) to a statistically confident assessment using a Capability Index (Cpk) that incorporates performance variability. This allows development teams to move beyond simple performance measurement to a more mature discussion about statistical confidence and quantified deployment risk. The cargo deck space estimation case study particularly demonstrated how a seemingly good average performance could mask a narrow safety margin when variability is considered. The expert interviews underscored the practical utility, ease of use, and high potential for adoption of SCFC, recognizing its ability to fill a significant gap in current AI product quality evaluation processes, especially in MLOps and continuous monitoring scenarios. While the approach has dependencies on stakeholder input for defining limits and requires careful consideration of sampling strategies based on the problem context, it provides a valuable framework for systematic and robust functional correctness assessment of AI systems.