vllm - 💡(How to fix) Fix [SM120] _dummy_sampler_run hangs indefinitely on RTX 5090 due to top_k=vocab_size-1 triggering an SM120-broken top-k masking kernel (one-line fix)

vllm2026-05-18 14:49:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

On physical RTX 5090 hardware (SM120 / compute_120f), vllm serve hangs indefinitely during engine startup. The hang is in _dummy_sampler_run (called from profile_run → _initialize_kv_caches) when it invokes the sampler with top_k = logits.size(1) - 1 (i.e. vocab_size - 1, which is 151935 for Qwen/Qwen2.5-3B-Instruct).

Real production sampling uses small top_k values (e.g. 40, 50) and does not hang. Substituting a small top_k in the dummy run resolves the startup hang with no functional regression — the dummy sampler exists only for KV-cache memory profiling, where realistic top_k values give an identical memory footprint.

Root Cause

Fix Action

Fix / Workaround

Verified workaround

I monkey-patched _dummy_sampler_run to substitute top_k=50 into the dummy SamplingMetadata (via a .pth injection in the venv's site-packages, gated on an env var so it's opt-in). With the patch active, vllm serve reaches Application startup complete in ~5 seconds. No regressions observed in subsequent inference behavior (other independent SM120 issues exist; see "Other notes" below).

After applying the workaround, two additional SM120 hangs become visible later in the lifecycle:

Code Example

- top_k=dummy_tensors(logits.size(1) - 1),
+ top_k=dummy_tensors(50),

---

Thread MainThread (active): "VLLM::EngineCore"
    apply_top_k_top_p_triton (vllm/v1/sample/ops/topk_topp_triton.py:1000)
    apply_top_k_top_p          (vllm/v1/sample/ops/topk_topp_sampler.py:252)
    forward_native             (vllm/v1/sample/ops/topk_topp_sampler.py:106)
    sample                     (vllm/v1/sample/sampler.py:276)
    forward                    (vllm/v1/sample/sampler.py:97)
    _dummy_sampler_run         (vllm/v1/worker/gpu_model_runner.py:5663)
    profile_run                (vllm/v1/worker/gpu_model_runner.py:5855)
    determine_available_memory (vllm/v1/worker/gpu_worker.py:370)
    _initialize_kv_caches      (vllm/v1/engine/core.py:250)

---

top_k_mask_logits          (flashinfer/sampling.py:564)
top_k_mask_logits          (flashinfer/sampling.py:1820)
top_k_top_p_sampling_from_logits (flashinfer/sampling.py:1414)
flashinfer_sample          (vllm/v1/sample/ops/topk_topp_sampler.py:391)
forward_cuda               (vllm/v1/sample/ops/topk_topp_sampler.py:140)
sample                     (vllm/v1/sample/sampler.py:276)
_dummy_sampler_run         (vllm/v1/worker/gpu_model_runner.py:5663)
...

---

dummy_metadata = SamplingMetadata(
    temperature=dummy_tensors(0.5),
    all_greedy=False,
    all_random=False,
    top_p=dummy_tensors(0.9),
    top_k=dummy_tensors(logits.size(1) - 1),  # <-- vocab_size - 1
    ...
)

---

CUDA_LAUNCH_BLOCKING=1 FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --enforce-eager \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.85 \
  --max-model-len 2048 \
  --port 8000

RAW_BUFFERClick to expand / collapse

Summary

One-line proposed fix

vllm/v1/worker/gpu_model_runner.py, in _dummy_sampler_run:

- top_k=dummy_tensors(logits.size(1) - 1),
+ top_k=dummy_tensors(50),

(or any small constant in the typical-production range, e.g. 20–100.)

I'd be glad to send a PR. Wanted to surface the diagnosis first in case there's a different preferred remediation (e.g. detect SM120 and short-circuit, or fix the underlying kernel).

Diagnostic evidence (py-spy, captured 2026-05-17 on a real RTX 5090)

Thread MainThread (active): "VLLM::EngineCore"
    apply_top_k_top_p_triton (vllm/v1/sample/ops/topk_topp_triton.py:1000)
    apply_top_k_top_p          (vllm/v1/sample/ops/topk_topp_sampler.py:252)
    forward_native             (vllm/v1/sample/ops/topk_topp_sampler.py:106)
    sample                     (vllm/v1/sample/sampler.py:276)
    forward                    (vllm/v1/sample/sampler.py:97)
    _dummy_sampler_run         (vllm/v1/worker/gpu_model_runner.py:5663)
    profile_run                (vllm/v1/worker/gpu_model_runner.py:5855)
    determine_available_memory (vllm/v1/worker/gpu_worker.py:370)
    _initialize_kv_caches      (vllm/v1/engine/core.py:250)

py-spy dump --native shows the host thread is stuck inside cuLaunchKernel → clock_gettime — a CUDA_LAUNCH_BLOCKING=1 sync poll waiting for a GPU kernel that never returns. The process sits at 100% CPU, GPU at 0% util, allocated memory ~7128 MiB, observed for >9 hours before manual SIGTERM cleanup.

The same hang reproduces with VLLM_USE_FLASHINFER_SAMPLER=1:

top_k_mask_logits          (flashinfer/sampling.py:564)
top_k_mask_logits          (flashinfer/sampling.py:1820)
top_k_top_p_sampling_from_logits (flashinfer/sampling.py:1414)
flashinfer_sample          (vllm/v1/sample/ops/topk_topp_sampler.py:391)
forward_cuda               (vllm/v1/sample/ops/topk_topp_sampler.py:140)
sample                     (vllm/v1/sample/sampler.py:276)
_dummy_sampler_run         (vllm/v1/worker/gpu_model_runner.py:5663)
...

So both the Triton-based and the FlashInfer-based top-k masking kernels hang for this top_k value on SM120. Switching between the two doesn't help; the bug appears to be specific to the codegen/scheduling these kernels emit when top_k approaches vocab_size.

Verified workaround

Why the dummy-run argument is the worst case for these kernels

_dummy_sampler_run at gpu_model_runner.py line 5648 (vLLM 0.20.1) creates:

dummy_metadata = SamplingMetadata(
    temperature=dummy_tensors(0.5),
    all_greedy=False,
    all_random=False,
    top_p=dummy_tensors(0.9),
    top_k=dummy_tensors(logits.size(1) - 1),  # <-- vocab_size - 1
    ...
)

logits.size(1) is the model's vocab size. For Qwen2.5-3B this is 151,936, so top_k becomes 151,935 — essentially "keep top-151,935 of 151,936". This is presumably intended as a worst-case argument for memory profiling, but it exercises a kernel codegen path that doesn't terminate on SM120 in either implementation.

Real production requests pass per-call top_k from the SamplingParams (typically ≤ 100). Substituting a realistic value in the dummy run keeps the memory-profiling intent intact while avoiding the buggy code path.

Hardware and software

Item	Value
GPU	NVIDIA GeForce RTX 5090 (SM120 / compute_120f, 32 GB VRAM)
OS	Ubuntu 24.04.4 LTS, kernel 6.17.0-22-generic
NVIDIA driver	595.58.03 (open kernel modules)
CUDA toolkit	13.2
torch	2.11.0+cu130
vllm	0.20.1 (PyPI wheel)
flashinfer-python	0.6.11 (PyPI wheel, JIT-only)
Env	`CUDA_LAUNCH_BLOCKING=1`, `FLASHINFER_DISABLE_VERSION_CHECK=1`, `--attention-backend flashinfer`

Not WSL2, not DGX Spark / SM121 — physical bare-metal RTX 5090.

Other notes (not blocking this fix)

After applying the workaround, two additional SM120 hangs become visible later in the lifecycle:

With --enforce-eager: first inference request hangs in default_unquantized_gemm at vllm/model_executor/layers/utils.py:98 (the lm-head matmul). Same shape: kernel launched, host blocks on CLB=1 sync, GPU never returns.
With cudagraph_mode=FULL_AND_PIECEWISE: weight loading stalls and the GPU enters [requires reset] at 6752 MiB allocated.

These appear to be independent kernel-scheduling issues on SM120 affecting different kernel families. The dummy_sampler fix is independent and worth landing on its own; the others I plan to file separately (or comment on existing trackers) once I have similar diagnostic evidence.

Reproduction

Any vLLM 0.20.1 (likely 0.21.x too — not retested with the fix) startup on an RTX 5090 with default sampling settings:

CUDA_LAUNCH_BLOCKING=1 FLASHINFER_DISABLE_VERSION_CHECK=1 \
vllm serve Qwen/Qwen2.5-3B-Instruct \
  --enforce-eager \
  --attention-backend flashinfer \
  --gpu-memory-utilization 0.85 \
  --max-model-len 2048 \
  --port 8000

Hangs indefinitely at profile_run step. With top_k=50 substituted in _dummy_sampler_run, reaches Application startup complete in ~5 seconds.

Offer

Reliable RTX 5090 reproducer available. Happy to test the official patch (or any alternative) on this hardware with overnight turnaround.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#orchestration issue #cache issue #memory leak #API versioning #request timeout

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [SM120] _dummy_sampler_run hangs indefinitely on RTX 5090 due to top_k=vocab_size-1 triggering an SM120-broken top-k masking kernel (one-line fix)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Verified workaround

Code Example

Summary

One-line proposed fix

Diagnostic evidence (py-spy, captured 2026-05-17 on a real RTX 5090)

Verified workaround

Why the dummy-run argument is the worst case for these kernels

Hardware and software

Other notes (not blocking this fix)

Reproduction

Offer

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [SM120] _dummy_sampler_run hangs indefinitely on RTX 5090 due to top_k=vocab_size-1 triggering an SM120-broken top-k masking kernel (one-line fix)

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Verified workaround

Code Example

Summary

One-line proposed fix

Diagnostic evidence (py-spy, captured 2026-05-17 on a real RTX 5090)

Verified workaround

Why the dummy-run argument is the worst case for these kernels

Hardware and software

Other notes (not blocking this fix)

Reproduction

Offer

Still need to ship something?

RELATED_DISCOVERY

TRENDING