vllm - 💡(How to fix) Fix [Bug]: CUTLASS block FP8 kernel failure on Blackwell GB300 [5 comments, 2 participants]

vllm2026-04-08 23:42:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39367•Fetched 2026-04-09 07:51:33

View on GitHub

Comments

Participants

Timeline

Reactions

Author

qizzzh

Participants

mgoin

qizzzh

Timeline (top)

commented ×5mentioned ×4subscribed ×4closed ×1

Error Message

cutlass_scaled_mm_supports_block_fp8() incorrectly reports support for Blackwell (cc=10.3, to_int() = 103 ≥ 100), but the kernel raises RuntimeError: Error Internal at runtime during the profile/dummy run. The patch forces the Triton fallback for all Blackwell devices, which works correctly. However it has a perf penalty.

Fix Action

Fix / Workaround

For the patch below the vLLM was built from source.

Code Example

diff --git a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
--- a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
+++ b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
@@ -23,6 +23,11 @@ def cutlass_block_fp8_supported() -> bool:
     capability_tuple = current_platform.get_device_capability()
     capability = -1 if capability_tuple is None else capability_tuple.to_int()
 
+    # CUTLASS block FP8 kernels fail at runtime on Blackwell (B300, cc >= 100).
+    # Fall back to Triton instead.
+    if capability >= 100:
+        return False
+
     return ops.cutlass_scaled_mm_supports_block_fp8(capability)

---

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --seed 42 \
  --max-model-len 10240 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 512 \
  --no-enable-prefix-caching \
  --no-enable-log-requests \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95

RAW_BUFFERClick to expand / collapse

Your current environment

Initially followed the instructions in https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#nvidia-cuda.

For the patch below the vLLM was built from source.

Both are 0.19.0

🐛 Describe the bug

diff --git a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
--- a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
+++ b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
@@ -23,6 +23,11 @@ def cutlass_block_fp8_supported() -> bool:
     capability_tuple = current_platform.get_device_capability()
     capability = -1 if capability_tuple is None else capability_tuple.to_int()
 
+    # CUTLASS block FP8 kernels fail at runtime on Blackwell (B300, cc >= 100).
+    # Fall back to Triton instead.
+    if capability >= 100:
+        return False
+
     return ops.cutlass_scaled_mm_supports_block_fp8(capability)

Can be reproduced with the following using any request

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --seed 42 \
  --max-model-len 10240 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 512 \
  --no-enable-prefix-caching \
  --no-enable-log-requests \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be mitigated by modifying the cutlass_block_fp8_supported function to return False for Blackwell devices with a capability of 100 or higher, forcing a fallback to Triton.

Guidance

The cutlass_scaled_mm_supports_block_fp8 function incorrectly reports support for Blackwell devices, causing a runtime error.
The provided patch forces the Triton fallback for all Blackwell devices, which works correctly but has a performance penalty.
To verify the issue, run the provided vllm serve command with the specified parameters.
The modified cutlass_block_fp8_supported function can be used as a temporary workaround to avoid the runtime error.

Example

def cutlass_block_fp8_supported() -> bool:
    capability_tuple = current_platform.get_device_capability()
    capability = -1 if capability_tuple is None else capability_tuple.to_int()
    
    # CUTLASS block FP8 kernels fail at runtime on Blackwell (B300, cc >= 100).
    # Fall back to Triton instead.
    if capability >= 100:
        return False
    
    return ops.cutlass_scaled_mm_supports_block_fp8(capability)

Notes

The provided patch is a temporary workaround, and a more permanent solution may be required to avoid the performance penalty.

Recommendation

Apply workaround: The modified cutlass_block_fp8_supported function can be used as a temporary workaround to avoid the runtime error, although it may have a performance penalty.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#installation #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: CUTLASS block FP8 kernel failure on Blackwell GB300 [5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: CUTLASS block FP8 kernel failure on Blackwell GB300 [5 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING