1. **Fail fast at init.** If `kv_cache_dtype=nvfp4` is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. *"nvfp4 KV cache requires sm_100/sm_103; use fp8"*), instead of an `AttributeError` on the first token followed by `Unsupported architecture`. 2. **Resolve the dtype internally.** Map the `nvfp4` kv-cache-dtype string to the actual storage dtype (`torch.uint8`) vLLM hands flashinfer, rather than relying on `getattr(torch, "nvfp4")` resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise. Workaround: `--kv-cache-dtype fp8`.

vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init

Error Message

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build prefill_wrapper.plan(...) File ".../flashinfer/prefill.py", line 1859, in plan kv_data_type = canonicalize_torch_dtype(kv_data_type) File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype return getattr(torch, dtype) AttributeError: module 'torch' has no attribute 'nvfp4'

Root Cause

--kv-cache-dtype nvfp4 on an NVFP4 checkpoint starts up clean, captures graphs, then kills the engine on the first request (not at config validation). NVFP4 weights are fine; only the NVFP4 KV-cache attention path fails. Root cause is upstream — trtllm-gen FP4 FMHA has no sm_120 kernel (NVIDIA/TensorRT-LLM#10241, #11799) — but vLLM in front of it accepts the flag and dies cryptically rather than rejecting it.

Code Example

GPU          NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver       595.71.05
CPU          Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS           Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm         0.21.1rc1.dev269+gb06813e87
torch        2.11.0+cu130 (CUDA 13.0)
flashinfer   0.6.11.post2
transformers 5.9.0 · triton 3.6.0

---

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
    prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
    kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
    return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'

---

File ".../flashinfer/prefill.py", line 254, in _paged_run
    op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture

---

vllm serve <an-NVFP4-checkpoint> \
    --kv-cache-dtype nvfp4 \
    --max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError

Environment

GPU          NVIDIA RTX PRO 6000 Blackwell Workstation Edition (sm_120, 96 GB)
Driver       595.71.05
CPU          Ampere Altra - ARM Neoverse-N1, 64 cores (aarch64)
OS           Ubuntu 24.04.4 LTS, kernel 6.17.0-23-generic
vllm         0.21.1rc1.dev269+gb06813e87
torch        2.11.0+cu130 (CUDA 13.0)
flashinfer   0.6.11.post2
transformers 5.9.0 · triton 3.6.0

(vllm collect-env can't run here — it shells out to pip and crashes on uv-only envs: get_pip_packages -> 'NoneType' object has no attribute 'splitlines'. Minor, separate.)

Bug

The first request fails in flashinfer.prefill.plan, because vLLM passes the literal string "nvfp4" as kv_data_type and flashinfer resolves it with getattr(torch, "nvfp4"):

File ".../vllm/v1/attention/backends/flashinfer.py", line 1170, in build
    prefill_wrapper.plan(...)
File ".../flashinfer/prefill.py", line 1859, in plan
    kv_data_type = canonicalize_torch_dtype(kv_data_type)
File ".../flashinfer/utils.py", line 254, in canonicalize_torch_dtype
    return getattr(torch, dtype)
AttributeError: module 'torch' has no attribute 'nvfp4'

Stock torch has no nvfp4 attribute (the packed FP4 dtype is torch.float4_e2m1fn_x2, and the KV buffer is allocated as torch.uint8). Aliasing torch.nvfp4 = torch.uint8 clears plan() and the strict _check_cached_qkv_data_type (buffer is uint8), and reaches the real wall:

File ".../flashinfer/prefill.py", line 254, in _paged_run
    op.trtllm_paged_attention_context(...)
RuntimeError: Error in function 'TllmGenFmhaRunner' at .../trtllm/fmha/fmhaRunner.cuh:
Unsupported architecture

vLLM forces this path (vllm/v1/attention/backends/flashinfer.py:773: backend = "trtllm-gen" if self.is_kvcache_nvfp4 else "auto"), and trtllm-gen FMHA has no sm_120 build.

Reproducer

vllm serve <an-NVFP4-checkpoint> \
    --kv-cache-dtype nvfp4 \
    --max-model-len 8192
# starts fine; first /v1/chat/completions request -> EngineDeadError

Expected behavior

Fail fast at init. If kv_cache_dtype=nvfp4 is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of an AttributeError on the first token followed by Unsupported architecture.
Resolve the dtype internally. Map the nvfp4 kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying on getattr(torch, "nvfp4") resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.

Workaround: --kv-cache-dtype fp8.

#32220 — NVFP4 KV Cache Support (the feature this sits behind; closed/implemented)
flashinfer-ai/flashinfer#2143 (fp4 KV for trtllm paged attention, P0), #2294 (SM120 nvfp4 KV decode via XQA — done), #2555 (SM120 attention backend validation), #2207 (fp4 KV head_dim 128 gap), #2577 (SM120 fp4 GEMM)
NVIDIA/TensorRT-LLM#10241, #11799 (SM120 trtllm-gen FP4 FMHA kernel / cubins)

FAQ

Expected behavior

Fail fast at init. If kv_cache_dtype=nvfp4 is selected on an arch without a trtllm-gen FP4 FMHA kernel (sm_120/sm_121), reject it at engine init with a clear, actionable error (e.g. "nvfp4 KV cache requires sm_100/sm_103; use fp8"), instead of an AttributeError on the first token followed by Unsupported architecture.
Resolve the dtype internally. Map the nvfp4 kv-cache-dtype string to the actual storage dtype (torch.uint8) vLLM hands flashinfer, rather than relying on getattr(torch, "nvfp4") resolving in whatever torch/flashinfer happens to be pinned. Today that only works by accident of versions and breaks silently otherwise.

Workaround: --kv-cache-dtype fp8.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: --kv-cache-dtype nvfp4 crashes at first request on SM120 instead of failing fast at init

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Environment

Bug

Reproducer

Expected behavior

Related

FAQ

Expected behavior

Still need to ship something?

TRENDING