vllm - 💡(How to fix) Fix [Bug]: CUTLASS block FP8 kernel failure on Blackwell GB300 [5 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39367Fetched 2026-04-09 07:51:33
View on GitHub
Comments
5
Participants
2
Timeline
16
Reactions
0
Author
Participants
Timeline (top)
commented ×5mentioned ×4subscribed ×4closed ×1

Error Message

cutlass_scaled_mm_supports_block_fp8() incorrectly reports support for Blackwell (cc=10.3, to_int() = 103 ≥ 100), but the kernel raises RuntimeError: Error Internal at runtime during the profile/dummy run. The patch forces the Triton fallback for all Blackwell devices, which works correctly. However it has a perf penalty.

Fix Action

Fix / Workaround

For the patch below the vLLM was built from source.

cutlass_scaled_mm_supports_block_fp8() incorrectly reports support for Blackwell (cc=10.3, to_int() = 103 ≥ 100), but the kernel raises RuntimeError: Error Internal at runtime during the profile/dummy run. The patch forces the Triton fallback for all Blackwell devices, which works correctly. However it has a perf penalty.

Code Example

diff --git a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
--- a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
+++ b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
@@ -23,6 +23,11 @@ def cutlass_block_fp8_supported() -> bool:
     capability_tuple = current_platform.get_device_capability()
     capability = -1 if capability_tuple is None else capability_tuple.to_int()
 
+    # CUTLASS block FP8 kernels fail at runtime on Blackwell (B300, cc >= 100).
+    # Fall back to Triton instead.
+    if capability >= 100:
+        return False
+
     return ops.cutlass_scaled_mm_supports_block_fp8(capability)

---

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --seed 42 \
  --max-model-len 10240 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 512 \
  --no-enable-prefix-caching \
  --no-enable-log-requests \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95
RAW_BUFFERClick to expand / collapse

Your current environment

Initially followed the instructions in https://docs.vllm.ai/en/latest/getting_started/installation/gpu/#nvidia-cuda.

For the patch below the vLLM was built from source.

Both are 0.19.0

🐛 Describe the bug

cutlass_scaled_mm_supports_block_fp8() incorrectly reports support for Blackwell (cc=10.3, to_int() = 103 ≥ 100), but the kernel raises RuntimeError: Error Internal at runtime during the profile/dummy run. The patch forces the Triton fallback for all Blackwell devices, which works correctly. However it has a perf penalty.

diff --git a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
--- a/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
+++ b/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
@@ -23,6 +23,11 @@ def cutlass_block_fp8_supported() -> bool:
     capability_tuple = current_platform.get_device_capability()
     capability = -1 if capability_tuple is None else capability_tuple.to_int()
 
+    # CUTLASS block FP8 kernels fail at runtime on Blackwell (B300, cc >= 100).
+    # Fall back to Triton instead.
+    if capability >= 100:
+        return False
+
     return ops.cutlass_scaled_mm_supports_block_fp8(capability)

Can be reproduced with the following using any request

vllm serve Qwen/Qwen3.5-397B-A17B-FP8 \
  --seed 42 \
  --max-model-len 10240 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 512 \
  --no-enable-prefix-caching \
  --no-enable-log-requests \
  --tensor-parallel-size 4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be mitigated by modifying the cutlass_block_fp8_supported function to return False for Blackwell devices with a capability of 100 or higher, forcing a fallback to Triton.

Guidance

  • The cutlass_scaled_mm_supports_block_fp8 function incorrectly reports support for Blackwell devices, causing a runtime error.
  • The provided patch forces the Triton fallback for all Blackwell devices, which works correctly but has a performance penalty.
  • To verify the issue, run the provided vllm serve command with the specified parameters.
  • The modified cutlass_block_fp8_supported function can be used as a temporary workaround to avoid the runtime error.

Example

def cutlass_block_fp8_supported() -> bool:
    capability_tuple = current_platform.get_device_capability()
    capability = -1 if capability_tuple is None else capability_tuple.to_int()
    
    # CUTLASS block FP8 kernels fail at runtime on Blackwell (B300, cc >= 100).
    # Fall back to Triton instead.
    if capability >= 100:
        return False
    
    return ops.cutlass_scaled_mm_supports_block_fp8(capability)

Notes

The provided patch is a temporary workaround, and a more permanent solution may be required to avoid the performance penalty.

Recommendation

Apply workaround: The modified cutlass_block_fp8_supported function can be used as a temporary workaround to avoid the runtime error, although it may have a performance penalty.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING