vllm - ✅(Solved) Fix [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell [1 pull requests, 4 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37804Fetched 2026-04-08 01:12:52
View on GitHub
Comments
4
Participants
1
Timeline
14
Reactions
0
Participants
Assignees
Timeline (top)
commented ×4closed ×2cross-referenced ×2subscribed ×2

Qwen/Qwen3.5-35B-A3B-FP8 fails the GSM8K accuracy CI check on Blackwell (B200) GPUs. Investigation in #37618 identified the root cause as DeepGemm's E8M0 scale format on the dense FP8 block-quantized linear layers.

Error Message

  • DeepGemm error is 1.44-1.65x CUTLASS error per layer (vs BF16 reference)

Root Cause

On Blackwell (SM100+), DeepGemm requires E8M0 scale format (power-of-2 ceiling). This loses ~0.4-0.5 bits of precision per layer, compounding across ~80 FP8 block-quantized linear layers (attention projections + shared expert FFN) in Qwen3.5.

Key observations:

  • DeepGemm error is 1.44-1.65x CUTLASS error per layer (vs BF16 reference)
  • VLLM_USE_DEEP_GEMM_E8M0=0 is not viable: Blackwell's fp8_gemm_nt kernel hard-fails with Unsupported architecture or scaling factor types when disable_ue8m0_cast=True
  • MoE layers are unaffected (already use FLASHINFER_TRTLLM, not DeepGemm)

Fix Action

Fixed

PR fix notes

PR #37806: [Bugfix] Auto-disable DeepGemm for Qwen3.5 on Blackwell to fix FP8 accuracy degradation

Description (problem / solution / changelog)

Problem

Issue: #37804 (root cause identified in #37618)

On Blackwell GPUs (SM100+), DeepGemm requires E8M0 scale format (power-of-2 ceiling), which loses ~0.4-0.5 bits of precision per layer. This precision loss compounds across ~80 FP8 block-quantized linear layers in Qwen3.5 models, causing significant accuracy degradation on GSM8K:

  • DeepGemm ON (default): mean accuracy 0.7397 (3 runs: 0.7301, 0.7566, 0.7324)
  • DeepGemm OFF (env var): mean accuracy 0.7824 (3 runs: 0.7847, 0.7832, 0.7794)

Setting VLLM_USE_DEEP_GEMM_E8M0=0 is not viable: Blackwell's fp8_gemm_nt kernel hard-fails with Unsupported architecture or scaling factor types when disable_ue8m0_cast=True.

Fix

Add architecture-based auto-detection that disables DeepGemm for known-problematic model types on Blackwell. A model type exclusion set in vllm/utils/deep_gemm.py is checked during VllmConfig.__post_init__ (vllm/config/vllm.py). When matched, a new use_deep_gemm flag on Fp8Config is set to False, which propagates through Fp8LinearMethod and W8A8BlockFp8LinearOp (vllm/model_executor/layers/quantization/fp8.py, fp8_utils.py) to skip both the irreversible E8M0 weight requantization at load time and the DeepGemm kernel at runtime, falling back to CUTLASS instead. CI config updated in tests/evals/gsm8k/configs/models-qwen35-blackwell.txt.

What does NOT change

  • MoE path (already uses FLASHINFER_TRTLLM, not DeepGemm)
  • Other FP8 models (not in the exclusion set)
  • Hopper behavior (gated by is_device_capability_family(100))
  • Global is_deep_gemm_supported() function (not modified)
  • NVFP4 models (use ModelOptFp4Config, not Fp8Config)

Results

All measurements on NVIDIA B200, Qwen/Qwen3.5-35B-A3B-FP8, TP=4, --kv-cache-dtype fp8, GSM8K 1319 questions 5-shot.

Before Fix (DeepGemm ON, default)

RunAccuracyInvalid Rate
10.73010.0%
20.75660.0%
30.73240.0%
Mean0.73970.0%

After Fix (auto-disable triggers)

RunAccuracyInvalid Rate
10.78770.08%
20.78540.0%
30.79230.0%
Mean0.78850.03%

Summary

MetricBeforeAfterDelta
Mean accuracy0.73970.7885+0.0488

Server log confirms auto-disable:

WARNING Auto-disabled DeepGemm for model_type=qwen3_5_moe_text on Blackwell.
DeepGemm E8M0 scale format causes accuracy degradation for this architecture.
Falling back to CUTLASS. To disable DeepGemm globally, set VLLM_USE_DEEP_GEMM=0.

Changed files

  • tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-DEP2.yaml (modified, +1/-1)
  • tests/evals/gsm8k/configs/Qwen3.5-35B-A3B-FP8-DEP2.yaml (modified, +1/-1)
  • tests/evals/gsm8k/configs/models-qwen35-blackwell.txt (modified, +1/-0)
  • vllm/config/vllm.py (modified, +19/-0)
  • vllm/model_executor/layers/quantization/fp8.py (modified, +7/-2)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +5/-1)
  • vllm/utils/deep_gemm.py (modified, +18/-0)

Code Example

# Start server (Blackwell GPU)
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Run GSM8K eval
python3 tests/evals/gsm8k/gsm8k_eval.py --port 8000
# Accuracy: ~0.73 (FAIL, threshold is 0.78)

---

VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Accuracy: ~0.79 (PASS)
RAW_BUFFERClick to expand / collapse

Summary

Qwen/Qwen3.5-35B-A3B-FP8 fails the GSM8K accuracy CI check on Blackwell (B200) GPUs. Investigation in #37618 identified the root cause as DeepGemm's E8M0 scale format on the dense FP8 block-quantized linear layers.

Environment

  • GPU: NVIDIA B200 (Blackwell, SM100+)
  • Model: Qwen/Qwen3.5-35B-A3B-FP8 (FP8 block-quantized MoE, weight_block_size [128, 128])
  • vLLM version: 0.18.1rc1.dev18
  • CI config: Qwen3.5-35B-A3B-FP8-DEP2.yaml (--kv-cache-dtype fp8 --data-parallel-size 2 --enable-expert-parallel --max-model-len 4096, accuracy threshold 0.86)

Reproduction

# Start server (Blackwell GPU)
vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Run GSM8K eval
python3 tests/evals/gsm8k/gsm8k_eval.py --port 8000
# Accuracy: ~0.73 (FAIL, threshold is 0.78)

Disabling DeepGemm restores accuracy:

VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

# Accuracy: ~0.79 (PASS)

Root Cause

On Blackwell (SM100+), DeepGemm requires E8M0 scale format (power-of-2 ceiling). This loses ~0.4-0.5 bits of precision per layer, compounding across ~80 FP8 block-quantized linear layers (attention projections + shared expert FFN) in Qwen3.5.

Key observations:

  • DeepGemm error is 1.44-1.65x CUTLASS error per layer (vs BF16 reference)
  • VLLM_USE_DEEP_GEMM_E8M0=0 is not viable: Blackwell's fp8_gemm_nt kernel hard-fails with Unsupported architecture or scaling factor types when disable_ue8m0_cast=True
  • MoE layers are unaffected (already use FLASHINFER_TRTLLM, not DeepGemm)

Measured Accuracy (GSM8K, 1319 questions, 5-shot, B200 TP=4)

ConfigRun 1Run 2Run 3Mean
DeepGemm ON (default)0.73010.75660.73240.7397
DeepGemm OFF (VLLM_USE_DEEP_GEMM=0)0.78470.78320.77940.7824

extent analysis

Fix Plan

To address the accuracy issue with DeepGemm's E8M0 scale format, we will modify the code to conditionally disable DeepGemm for FP8 block-quantized linear layers on Blackwell GPUs.

  • Modify the vllm configuration to include a conditional check for the GPU architecture and the model's quantization scheme.
  • Implement a fallback to the original Gemm implementation when DeepGemm is disabled.

Example code changes:

import torch

# Check if the GPU is Blackwell and the model uses FP8 block quantization
if torch.cuda.get_device_name(0).startswith("NVIDIA B200") and model.config.quantization == "fp8":
    # Conditionally disable DeepGemm
    os.environ["VLLM_USE_DEEP_GEMM"] = "0"

Verification

To verify the fix, run the GSM8K evaluation with the modified configuration:

VLLM_USE_DEEP_GEMM=0 vllm serve Qwen/Qwen3.5-35B-A3B-FP8 \
  --tensor-parallel-size 4 --max-model-len 4096 --kv-cache-dtype fp8 --port 8000

python3 tests/evals/gsm8k/gsm8k_eval.py --port 8000

The accuracy should now meet the threshold of 0.86.

Extra Tips

  • Ensure that the VLLM_USE_DEEP_GEMM environment variable is properly set before running the vllm server.
  • Consider adding additional logging or debugging statements to monitor the performance and accuracy of the model with DeepGemm disabled.
  • If further issues arise, investigate alternative quantization schemes or Gemm implementations that may provide better accuracy and performance on Blackwell GPUs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] DeepGemm E8M0 scale format causes accuracy degradation for Qwen3.5 FP8 on Blackwell [1 pull requests, 4 comments, 1 participants]