vllm - ✅(Solved) Fix [Feature]: Dynamic PDL Enablement [1 pull requests, 6 comments, 5 participants]

vllm2026-04-21 19:13:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40543•Fetched 2026-04-22 07:43:56

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6mentioned ×5subscribed ×5cross-referenced ×1

Fix Action

Fixed

Fixed by PR: Auto enable pdl (https://github.com/vllm-project/vllm/pull/40555)

PR fix notes

PR #40555: Auto enable pdl

Repository: vllm-project/vllm
Author: benchislett
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40555

Description (problem / solution / changelog)

Purpose

Automatically enable PDL when available.

Unfortunately, there are some downsides here at large batch sizes. See: https://github.com/vllm-project/vllm/issues/40543

Test Plan

Test Result

Changed files

vllm/platforms/cuda.py (modified, +5/-0)

Code Example

@contextmanager
def maybe_enable_pdl_for_full_cudagraph(
    cudagraph_runtime_mode: CUDAGraphMode,
) -> Iterator[None]:
    # PDL reduces kernel launch overhead when kernels are captured into a
    # full CUDA graph on SM90+, but its host-side cost outweighs the win
    # outside of one. Toggle the env vars only while capturing a FULL
    # graph so PIECEWISE / non-graph paths aren't affected, and defer to
    # the user if they have already chosen an explicit setting.
    if (
        cudagraph_runtime_mode != CUDAGraphMode.FULL
        or not current_platform.is_cuda()
        or not current_platform.has_device_capability(90)
        or "TRTLLM_ENABLE_PDL" in os.environ
        or "TORCHINDUCTOR_ENABLE_PDL" in os.environ
    ):
        yield
        return

    os.environ["TRTLLM_ENABLE_PDL"] = "1"
    os.environ["TORCHINDUCTOR_ENABLE_PDL"] = "1"
    logger.info_once(
        "Enabling PDL (TRTLLM_ENABLE_PDL=1, TORCHINDUCTOR_ENABLE_PDL=1) "
        "while capturing full CUDA graphs on SM90+ NVIDIA GPU.",
        scope="local",
    )
    try:
        yield
    finally:
        os.environ.pop("TRTLLM_ENABLE_PDL", None)
        os.environ.pop("TORCHINDUCTOR_ENABLE_PDL", None)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

There are two undocumented flags that improve low-latency performance on Hopper and Blackwell:

TRTLLM_ENABLE_PDL: Enable PDL in the DeepSeek fused QKV A Projection.
TORCHINDUCTOR_ENABLE_PDL: Enable PDL in all torch.compile'd triton kernels.

PDL is a valuable feature, but unfortunately adds some slight host overhead which can hurt in the prefill phase. We want to enable this only during FULL cuda graph execution, but not during PIECEWISE or NONE graph modes. This adds a fair bit of complexity to the implementation.

I am not sure if this can be implemented simply by just decorating set_forward_context or similar, since I don't know if/how torch.compile caches the compiled kernels; I am concerned that the piecewise graph capture might pre-populate torch's caches with non-PDL kernels and those would be used during the FULL graph capture. I'm not sure this is the case, but it seems probable and I don't see a clear path around it.

<details><summary>Example Implementation</summary> <p>


@contextmanager
def maybe_enable_pdl_for_full_cudagraph(
    cudagraph_runtime_mode: CUDAGraphMode,
) -> Iterator[None]:
    # PDL reduces kernel launch overhead when kernels are captured into a
    # full CUDA graph on SM90+, but its host-side cost outweighs the win
    # outside of one. Toggle the env vars only while capturing a FULL
    # graph so PIECEWISE / non-graph paths aren't affected, and defer to
    # the user if they have already chosen an explicit setting.
    if (
        cudagraph_runtime_mode != CUDAGraphMode.FULL
        or not current_platform.is_cuda()
        or not current_platform.has_device_capability(90)
        or "TRTLLM_ENABLE_PDL" in os.environ
        or "TORCHINDUCTOR_ENABLE_PDL" in os.environ
    ):
        yield
        return

    os.environ["TRTLLM_ENABLE_PDL"] = "1"
    os.environ["TORCHINDUCTOR_ENABLE_PDL"] = "1"
    logger.info_once(
        "Enabling PDL (TRTLLM_ENABLE_PDL=1, TORCHINDUCTOR_ENABLE_PDL=1) "
        "while capturing full CUDA graphs on SM90+ NVIDIA GPU.",
        scope="local",
    )
    try:
        yield
    finally:
        os.environ.pop("TRTLLM_ENABLE_PDL", None)
        os.environ.pop("TORCHINDUCTOR_ENABLE_PDL", None)

</p> </details>

Alternatives

Alternatively, we can just leave it up to the caller to provide these flags or add them to the --performance-mode latency feature set.

Either way, we'll need to document this.

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Enable PDL only during FULL CUDA graph execution by using the provided maybe_enable_pdl_for_full_cudagraph context manager to toggle the TRTLLM_ENABLE_PDL and TORCHINDUCTOR_ENABLE_PDL environment variables.

Guidance

Use the maybe_enable_pdl_for_full_cudagraph context manager to enable PDL for FULL CUDA graph execution and disable it for PIECEWISE or NONE graph modes.
Verify that the TRTLLM_ENABLE_PDL and TORCHINDUCTOR_ENABLE_PDL environment variables are being set and unset correctly within the context manager.
Consider documenting the usage of these environment variables and their impact on performance.
Alternatively, consider adding these flags to the --performance-mode latency feature set or leaving it up to the caller to provide them.

Example

The provided maybe_enable_pdl_for_full_cudagraph context manager can be used as follows:

with maybe_enable_pdl_for_full_cudagraph(CUDAGraphMode.FULL):
    # Capture FULL CUDA graph

Notes

The implementation assumes that the torch.compile caching behavior does not interfere with the context manager's ability to toggle the PDL flags. However, this assumption may not hold, and further testing may be necessary to verify the correctness of the implementation.

Recommendation

Apply the workaround using the maybe_enable_pdl_for_full_cudagraph context manager, as it provides a targeted solution for enabling PDL only during FULL CUDA graph execution.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#inference speed #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Dynamic PDL Enablement [1 pull requests, 6 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40555: Auto enable pdl

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Dynamic PDL Enablement [1 pull requests, 6 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #40555: Auto enable pdl

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING