CUDA Programmatic Dependent Launch (PDL)

本文最后更新于 2026年4月23日

PDL(Programmatic Dependent Launch) 是一种优化latency的办法,可以一定程度上overlap两个有数据依赖的kernel。

概念介绍

正常 CUDA 语义是:

  • 同一个 stream 里的 kernel:严格串行执行

比如:

1
kernel1 → kernel2 → kernel3

一个 kernel 内部大致分为:

  1. launch overhead(硬件启动)
  2. prolog(初始化) ✅ 不依赖前一个 kernel
  3. mainloop(真正计算) ❗依赖数据
  4. memory barrier(写回)

核心观察:

只有 mainloop 真的依赖前一个 kernel

PDL 的核心思想是让kernel2 的 launch + prolog和kernel1 的执行(mainloop + writeback)重叠起来

pdl

有PDL以后,执行的workflow会变为

1
2
3
K1 还在跑  → K2 已经开始做“前半部分”

等数据 ready → 再进入 K2 mainloop

使用示例

primary kernel应该在准备好启动secondary kernel时,调用 cudaTriggerProgrammaticLaunchCompletion,通知所有线程块已经准备好。

secondary_kernel使用 cudaGridDependencySynchronize声明后续操作都需要前一个kernel彻底结束

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
__global__ void primary_kernel() {
// Initial work that should finish before starting secondary kernel

// Trigger the secondary kernel
cudaTriggerProgrammaticLaunchCompletion();

// Work that can coincide with the secondary kernel
}

__global__ void secondary_kernel()
{
// Independent work

// Blocks until all dependent primary kernels have made their results visible in global memory
cudaGridDependencySynchronize();

// Dependent work
}

cudaLaunchAttribute attribute[1];
attribute[0].id = cudaLaunchAttributeProgrammaticStreamSerialization;
attribute[0].val.programmaticStreamSerializationAllowed = 1;
configSecondary.attrs = attribute;
configSecondary.numAttrs = 1;

primary_kernel<<<grid_dim, block_dim, 0, stream>>>();
cudaLaunchKernelEx(&configSecondary, secondary_kernel);

Triton 也提供了PDL支持,参考 Triton tutorials

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
@triton.jit
def primary_kernel(...):
# Work needed by the secondary kernel
tl.extra.cuda.gdc_launch_dependents()
# Work that can overlap with the secondary kernel


@triton.jit
def secondary_kernel(...):
# Work independent of the primary kernel
tl.extra.cuda.gdc_wait()
# Work dependent on the primary kernel


primary_kernel[grid](..., launch_pdl=True)
secondary_kernel[grid](..., launch_pdl=True)

参考资料: - https://yang-yifan.github.io/blogs/pdl/pdl.html