AI Summary • Published on Feb 23, 2026
Graphics Processing Units (GPUs) are essential for Artificial Intelligence (AI) in safety-critical domains like autonomous driving, where both computational efficiency and predictable timing are crucial. However, the inherent complexity of data dependencies and resource contention among kernels within a GPU task, often represented as a Directed Acyclic Graph (DAG), can lead to unpredictable execution delays. Existing scheduling approaches often overlook the varied computational loads and resource requirements of kernels, or they fail to account for inter-kernel dependency delays. Furthermore, the black-box nature of GPU hardware and the unreliability of kernel-level preemption hinder the development of predictable scheduling and timing analysis methods for complex DAG-structured GPU tasks.
This paper introduces a comprehensive scheduling and timing analysis framework for DAG-structured GPU tasks designed to achieve a reduced and predictable makespan. The method begins by decomposing the GPU task DAG into a sequence of disjoint "balanced groups," where kernels within each group can execute concurrently. A key component is the parallelism scaling mechanism, which adjusts the computing resource requirements for kernels within a group proportionally to their computation loads, aiming for balanced execution times and mitigating resource contention. For large kernels that might exceed the available GPU capacity, a node segmentation mechanism divides them into smaller, sequential segments. To ensure predictable overall execution, the framework also constructs extra dependencies that enforce a sequential execution order among these balanced groups. Crucially, this entire method is implemented using standard CUDA APIs, requiring no specialized hardware or software support and avoiding assumptions about unreliable kernel-level priorities.
The proposed scheduling and analysis framework was rigorously evaluated using both large-scale synthetic DAGs and real-world benchmark tasks, including Laplace, Gaussian elimination, and Stencil, across NVIDIA RTX 3060 and Jetson Orin Nano GPU platforms. Experimental results demonstrated significant improvements over existing methods. The approach effectively reduced the worst-case makespan by up to 32.8% and the measured task execution time by up to 21.3% when compared to methods like Greedy and Graham_para. Additionally, the proposed method consistently yielded more stable task execution times, evidenced by lower standard deviations, particularly in scenarios where kernels could neither fully saturate the GPU nor achieve their maximum parallelism, highlighting the effectiveness of the parallelism scaling mechanism. The performance advantages were also observed to increase with the depth of the DAG, as this led to more balanced groups where the method's benefits were amplified.
This work provides a significant advancement in real-time scheduling and analysis for GPU tasks, offering a path to achieve reduced and predictable task makespans essential for safety-critical AI applications. The ability to implement the entire framework using standard CUDA APIs makes it highly practical and readily deployable without requiring any additional hardware or software. By eliminating assumptions about kernel-level priorities and providing a robust timing analysis, the framework enhances the reliability and predictability of GPU task execution. Future research directions include extending this framework to address conditional DAGs on heterogeneous computing platforms, further broadening its applicability and impact.