vllm - ✅(Solved) Fix [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell [1 pull requests, 4 comments, 1 participants]

vllm2026-03-22 12:41:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37804•Fetched 2026-04-08 01:12:52

View on GitHub

Comments

Participants

Timeline

Reactions

Author

vadiklyutiy

Participants

vadiklyutiy

Assignees

vadiklyutiy

Timeline (top)

commented ×4closed ×2cross-referenced ×2subscribed ×2

Qwen/Qwen3.5-35B-A3B-FP8 fails the GSM8K accuracy CI check on Blackwell (B200) GPUs. Investigation in #37618 identified the root cause as DeepGemm's E8M0 scale format on the dense FP8 block-quantized linear layers.

Error Message

DeepGemm error is 1.44-1.65x CUTLASS error per layer (vs BF16 reference)

Root Cause

On Blackwell (SM100+), DeepGemm requires E8M0 scale format (power-of-2 ceiling). This loses ~0.4-0.5 bits of precision per layer, compounding across ~80 FP8 block-quantized linear layers (attention projections + shared expert FFN) in Qwen3.5.

Key observations:

DeepGemm error is 1.44-1.65x CUTLASS error per layer (vs BF16 reference)
VLLM_USE_DEEP_GEMM_E8M0=0 is not viable: Blackwell's fp8_gemm_nt kernel hard-fails with Unsupported architecture or scaling factor types when disable_ue8m0_cast=True
MoE layers are unaffected (already use FLASHINFER_TRTLLM, not DeepGemm)

Fix Action

Fixed

Fixed by PR: [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation (https://github.com/vllm-project/vllm/pull/37806)

PR fix notes

PR #37806: [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation

Repository: vllm-project/vllm
Author: vadiklyutiy
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37806

Description (problem / solution / changelog)

Problem

Issue: #37804 (root cause identified in #37618)

On Blackwell GPUs (SM100+), DeepGemm requires E8M0 scale format (power-of-2 ceiling), which loses ~0.4-0.5 bits of precision per layer. This precision loss compounds across ~80 FP8 block-quantized linear layers in Qwen3.5 models, causing significant accuracy degradation on GSM8K:

DeepGemm ON (default): mean accuracy 0.7397 (3 runs: 0.7301, 0.7566, 0.7324)
DeepGemm OFF (env var): mean accuracy 0.7824 (3 runs: 0.7847, 0.7832, 0.7794)

Setting VLLM_USE_DEEP_GEMM_E8M0=0 is not viable: Blackwell's fp8_gemm_nt kernel hard-fails with Unsupported architecture or scaling factor types when disable_ue8m0_cast=True.

Fix

Add architecture-based auto-detection that disables DeepGemm for known-problematic model types on Blackwell. A model type exclusion set in vllm/utils/deep_gemm.py is checked during VllmConfig.__post_init__ (vllm/config/vllm.py). When matched, a new use_deep_gemm flag on Fp8Config is set to False, which propagates through Fp8LinearMethod and W8A8BlockFp8LinearOp (vllm/model_executor/layers/quantization/fp8.py, fp8_utils.py) to skip both the irreversible E8M0 weight requantization at load time and the DeepGemm kernel at runtime, falling back to CUTLASS instead. CI config updated in tests/evals/gsm8k/configs/models-qwen35-blackwell.txt.

What does NOT change

MoE path (already uses FLASHINFER_TRTLLM, not DeepGemm)
Other FP8 models (not in the exclusion set)
Hopper behavior (gated by is_device_capability_family(100))
Global is_deep_gemm_supported() function (not modified)
NVFP4 models (use ModelOptFp4Config, not Fp8Config)

Results

All measurements on NVIDIA B200, Qwen/Qwen3.5-35B-A3B-FP8, TP=4, --kv-cache-dtype fp8, GSM8K 1319 questions 5-shot.

Before Fix (DeepGemm ON, default)

Run	Accuracy	Invalid Rate
1	0.7301	0.0%
2	0.7566	0.0%
3	0.7324	0.0%
Mean	0.7397	0.0%

After Fix (auto-disable triggers)

Run	Accuracy	Invalid Rate
1	0.7877	0.08%
2	0.7854	0.0%
3	0.7923	0.0%
Mean	0.7885	0.03%

Summary

Metric	Before	After	Delta
Mean accuracy	0.7397	0.7885	+0.0488

Server log confirms auto-disable:

WARNING Auto-disabled DeepGemm for model_type=qwen3_5_moe_text on Blackwell.
DeepGemm E8M0 scale format causes accuracy degradation for this architecture.
Falling back to CUTLASS. To disable DeepGemm globally, set VLLM_USE_DEEP_GEMM=0.

Changed files

tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-DEP2.yaml (modified, +1/-1)
tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-FP8-DEP2.yaml (modified, +1/-1)
tests/evals/gsm8k/configs/models-qwen35-blackwell.txt (modified, +1/-0)
vllm/config/vllm.py (modified, +19/-0)
vllm/model_executor/layers/quantization/fp8.py (modified, +7/-2)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +5/-1)
vllm/utils/deep_gemm.py (modified, +18/-0)

Code Example

# Start server (Blackwell GPU)
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Run GSM8K eval
python3 tests/evals/gsm8k/gsm8k_eval.py --port 8000
# Accuracy: ~0.73 (FAIL, threshold is 0.78)

---

VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Accuracy: ~0.79 (PASS)

RAW_BUFFERClick to expand / collapse

Summary

Environment

GPU: NVIDIA B200 (Blackwell, SM100+)
Model: Qwen/Qwen3.5-35B-A3B-FP8 (FP8 block-quantized MoE, weight_block_size [128, 128])
vLLM version: 0.18.1rc1.dev18
CI config: Qwen3.5-35B-A3B-FP8-DEP2.yaml (--kv-cache-dtype fp8 --data-parallel-size 2 --enable-expert-parallel --max-model-len 4096, accuracy threshold 0.86)

Reproduction

# Start server (Blackwell GPU)
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Run GSM8K eval
python3 tests/evals/gsm8k/gsm8k_eval.py --port 8000
# Accuracy: ~0.73 (FAIL, threshold is 0.78)

Disabling DeepGemm restores accuracy:

VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Accuracy: ~0.79 (PASS)

Root Cause

Key observations:

DeepGemm error is 1.44-1.65x CUTLASS error per layer (vs BF16 reference)
VLLM_USE_DEEP_GEMM_E8M0=0 is not viable: Blackwell's fp8_gemm_nt kernel hard-fails with Unsupported architecture or scaling factor types when disable_ue8m0_cast=True
MoE layers are unaffected (already use FLASHINFER_TRTLLM, not DeepGemm)

Measured Accuracy (GSM8K, 1319 questions, 5-shot, B200 TP=4)

Config	Run 1	Run 2	Run 3	Mean
DeepGemm ON (default)	0.7301	0.7566	0.7324	0.7397
DeepGemm OFF (`VLLM_USE_DEEP_GEMM=0`)	0.7847	0.7832	0.7794	0.7824

extent analysis

Fix Plan

To address the accuracy issue with DeepGemm's E8M0 scale format, we will modify the code to conditionally disable DeepGemm for FP8 block-quantized linear layers on Blackwell GPUs.

Modify the vllm configuration to include a conditional check for the GPU architecture and the model's quantization scheme.
Implement a fallback to the original Gemm implementation when DeepGemm is disabled.

Example code changes:

import torch

# Check if the GPU is Blackwell and the model uses FP8 block quantization
if torch.cuda.get_device_name(0).startswith("NVIDIA B200") and model.config.quantization == "fp8":
    # Conditionally disable DeepGemm
    os.environ["VLLM_USE_DEEP_GEMM"] = "0"

Verification

To verify the fix, run the GSM8K evaluation with the modified configuration:

VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

python3 tests/evals/gsm8k/gsm8k_eval.py --port 8000

The accuracy should now meet the threshold of 0.86.

Extra Tips

Ensure that the VLLM_USE_DEEP_GEMM environment variable is properly set before running the vllm server.
Consider adding additional logging or debugging statements to monitor the performance and accuracy of the model with DeepGemm disabled.
If further issues arise, investigate alternative quantization schemes or Gemm implementations that may provide better accuracy and performance on Blackwell GPUs.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell [1 pull requests, 4 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37806: [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation

Description (problem / solution / changelog)

Problem

Fix

What does NOT change

Results

Before Fix (DeepGemm ON, default)

After Fix (auto-disable triggers)

Summary

Changed files

Code Example

Summary

Environment

Reproduction

Root Cause

Measured Accuracy (GSM8K, 1319 questions, 5-shot, B200 TP=4)

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING