vllm - ✅(Solved) Fix [Feature]: Dynamic PDL Enablement [1 pull requests, 6 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40543Fetched 2026-04-22 07:43:56
View on GitHub
Comments
6
Participants
5
Timeline
18
Reactions
0
Timeline (top)
commented ×6mentioned ×5subscribed ×5cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #40555: Auto enable pdl

Description (problem / solution / changelog)

Purpose

Automatically enable PDL when available.

Unfortunately, there are some downsides here at large batch sizes. See: https://github.com/vllm-project/vllm/issues/40543

Test Plan

Test Result

Changed files

  • vllm/platforms/cuda.py (modified, +5/-0)

Code Example

@contextmanager
def maybe_enable_pdl_for_full_cudagraph(
    cudagraph_runtime_mode: CUDAGraphMode,
) -> Iterator[None]:
    # PDL reduces kernel launch overhead when kernels are captured into a
    # full CUDA graph on SM90+, but its host-side cost outweighs the win
    # outside of one. Toggle the env vars only while capturing a FULL
    # graph so PIECEWISE / non-graph paths aren't affected, and defer to
    # the user if they have already chosen an explicit setting.
    if (
        cudagraph_runtime_mode != CUDAGraphMode.FULL
        or not current_platform.is_cuda()
        or not current_platform.has_device_capability(90)
        or "TRTLLM_ENABLE_PDL" in os.environ
        or "TORCHINDUCTOR_ENABLE_PDL" in os.environ
    ):
        yield
        return

    os.environ["TRTLLM_ENABLE_PDL"] = "1"
    os.environ["TORCHINDUCTOR_ENABLE_PDL"] = "1"
    logger.info_once(
        "Enabling PDL (TRTLLM_ENABLE_PDL=1, TORCHINDUCTOR_ENABLE_PDL=1) "
        "while capturing full CUDA graphs on SM90+ NVIDIA GPU.",
        scope="local",
    )
    try:
        yield
    finally:
        os.environ.pop("TRTLLM_ENABLE_PDL", None)
        os.environ.pop("TORCHINDUCTOR_ENABLE_PDL", None)
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

There are two undocumented flags that improve low-latency performance on Hopper and Blackwell:

  • TRTLLM_ENABLE_PDL: Enable PDL in the DeepSeek fused QKV A Projection.
  • TORCHINDUCTOR_ENABLE_PDL: Enable PDL in all torch.compile'd triton kernels.

PDL is a valuable feature, but unfortunately adds some slight host overhead which can hurt in the prefill phase. We want to enable this only during FULL cuda graph execution, but not during PIECEWISE or NONE graph modes. This adds a fair bit of complexity to the implementation.

I am not sure if this can be implemented simply by just decorating set_forward_context or similar, since I don't know if/how torch.compile caches the compiled kernels; I am concerned that the piecewise graph capture might pre-populate torch's caches with non-PDL kernels and those would be used during the FULL graph capture. I'm not sure this is the case, but it seems probable and I don't see a clear path around it.

<details><summary>Example Implementation</summary> <p>

@contextmanager
def maybe_enable_pdl_for_full_cudagraph(
    cudagraph_runtime_mode: CUDAGraphMode,
) -> Iterator[None]:
    # PDL reduces kernel launch overhead when kernels are captured into a
    # full CUDA graph on SM90+, but its host-side cost outweighs the win
    # outside of one. Toggle the env vars only while capturing a FULL
    # graph so PIECEWISE / non-graph paths aren't affected, and defer to
    # the user if they have already chosen an explicit setting.
    if (
        cudagraph_runtime_mode != CUDAGraphMode.FULL
        or not current_platform.is_cuda()
        or not current_platform.has_device_capability(90)
        or "TRTLLM_ENABLE_PDL" in os.environ
        or "TORCHINDUCTOR_ENABLE_PDL" in os.environ
    ):
        yield
        return

    os.environ["TRTLLM_ENABLE_PDL"] = "1"
    os.environ["TORCHINDUCTOR_ENABLE_PDL"] = "1"
    logger.info_once(
        "Enabling PDL (TRTLLM_ENABLE_PDL=1, TORCHINDUCTOR_ENABLE_PDL=1) "
        "while capturing full CUDA graphs on SM90+ NVIDIA GPU.",
        scope="local",
    )
    try:
        yield
    finally:
        os.environ.pop("TRTLLM_ENABLE_PDL", None)
        os.environ.pop("TORCHINDUCTOR_ENABLE_PDL", None)
</p> </details>

Alternatives

Alternatively, we can just leave it up to the caller to provide these flags or add them to the --performance-mode latency feature set.

Either way, we'll need to document this.

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Enable PDL only during FULL CUDA graph execution by using the provided maybe_enable_pdl_for_full_cudagraph context manager to toggle the TRTLLM_ENABLE_PDL and TORCHINDUCTOR_ENABLE_PDL environment variables.

Guidance

  • Use the maybe_enable_pdl_for_full_cudagraph context manager to enable PDL for FULL CUDA graph execution and disable it for PIECEWISE or NONE graph modes.
  • Verify that the TRTLLM_ENABLE_PDL and TORCHINDUCTOR_ENABLE_PDL environment variables are being set and unset correctly within the context manager.
  • Consider documenting the usage of these environment variables and their impact on performance.
  • Alternatively, consider adding these flags to the --performance-mode latency feature set or leaving it up to the caller to provide them.

Example

The provided maybe_enable_pdl_for_full_cudagraph context manager can be used as follows:

with maybe_enable_pdl_for_full_cudagraph(CUDAGraphMode.FULL):
    # Capture FULL CUDA graph

Notes

The implementation assumes that the torch.compile caching behavior does not interfere with the context manager's ability to toggle the PDL flags. However, this assumption may not hold, and further testing may be necessary to verify the correctness of the implementation.

Recommendation

Apply the workaround using the maybe_enable_pdl_for_full_cudagraph context manager, as it provides a targeted solution for enabling PDL only during FULL CUDA graph execution.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Dynamic PDL Enablement [1 pull requests, 6 comments, 5 participants]