vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build prefill_wrapper.plan(...) File ".../flashinfer/prefill.py", line 1859, in plan kv_data_type = canonicalize_torch_dtype(kv_data_type) File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype return getattr(torch, dtype) AttributeError: module 'torch' has no attribute 'nvfp4'

Root Cause

--kv-cache-dtype nvfp4 on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the first request (not at config validation). NVFP4 weights are fine; only the NVFP4 KV-cache attention path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.

Fix Action

Fix / Workaround

Workaround: --kv-cache-dtype fp8.

Code Example

GPU          NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver       595.71.05
CPU          Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS           Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm         0.21.1rc1.dev269+gb06813e87
torch        2.11.0+cu130 (CUDA 13.0)
flashinfer   0.6.11.post2
transformers 5.9.0 · triton 3.6.0

---

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
    prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
    kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
    return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'

---

File ".../flashinfer/prefill.py", line 254, in _paged_run
    op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture

---

vllm serve <an-NVFP4-checkpoint> \
    --kv-cache-dtype nvfp4 \
    --max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError
RAW_BUFFERClick to expand / collapse

Environment

GPU          NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver       595.71.05
CPU          Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS           Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm         0.21.1rc1.dev269+gb06813e87
torch        2.11.0+cu130 (CUDA 13.0)
flashinfer   0.6.11.post2
transformers 5.9.0 · triton 3.6.0

(vllm collect-env can't run here — it shells out to pip and crashes on uv-only envs: get_pip_packages -> 'NoneType' object has no attribute 'splitlines'. Minor, separate.)

Bug

--kv-cache-dtype nvfp4 on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the first request (not at config validation). NVFP4 weights are fine; only the NVFP4 KV-cache attention path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.

The first request fails in flashinfer.prefill.plan, because vLLM passes the literal string "nvfp4" as kv_data_type and flashinfer resolves it with getattr(torch, "nvfp4"):

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
    prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
    kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
    return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'

Stock torch has no nvfp4 attribute (the packed FP4 dtype is torch.float4_e2m1fn_x2, and the KV buffer is allocated as torch.uint8). Aliasing torch.nvfp4 = torch.uint8 clears plan() and the strict _check_cached_qkv_data_type (buffer is uint8), and reaches the real wall:

File ".../flashinfer/prefill.py", line 254, in _paged_run
    op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture

vLLM forces this path (vllm/v1/attention/backends/flashinfer.py:773: backend = "trtllm-gen" if self.is_kvcache_nvfp4 else "auto"), and trtllm-gen FMHA has no sm_120 build.

Reproducer

vllm serve <an-NVFP4-checkpoint> \
    --kv-cache-dtype nvfp4 \
    --max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError

Expected behavior

  1. Fail fast at init. If kv_cache_dtype=nvfp4 is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of an AttributeError on the first token followed by Unsupported architecture.
  2. Resolve the dtype internally. Map the nvfp4 kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying on getattr(torch, "nvfp4") resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.

Workaround: --kv-cache-dtype fp8.

Related

  • #32220 — NVFP4 KV Cache Support (the feature this sits behind; closed/implemented)
  • flashinfer-ai/flashinfer#2143 (fp4 KV for trtllm paged attention, P0), #2294 (SM120 nvfp4 KV decode via XQA — done), #2555 (SM120 attention backend validation), #2207 (fp4 KV head_dim 128 gap), #2577 (SM120 fp4 GEMM)
  • NVIDIA/TensorRT-LLM#10241, #11799 (SM120 trtllm-gen FP4 FMHA kernel / cubins)

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

  1. Fail fast at init. If kv_cache_dtype=nvfp4 is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of an AttributeError on the first token followed by Unsupported architecture.
  2. Resolve the dtype internally. Map the nvfp4 kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying on getattr(torch, "nvfp4") resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.

Workaround: --kv-cache-dtype fp8.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init