vllm - ✅(Solved) Fix [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38729Fetched 2026-04-08 02:23:09
View on GitHub
Comments
1
Participants
1
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1referenced ×1

Fix Action

Fix / Workaround

Proposed Workaround

PR fix notes

PR #38730: [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang

Description (problem / solution / changelog)

Summary

All models hang indefinitely on GB300 (SM103, CC 10.3) during inference with large batch sizes. The GPU shows 99% SM utilization and 0% memory bandwidth. This is a regression introduced by the FlashInfer 0.6.6 to 0.6.7 upgrade, where TRTLLM attention kernels are no longer forward-compatible with SM103.

GB200 (SM100) is not affected.

Related FlashInfer issue: flashinfer-ai/flashinfer#2939 (fix ongoing on the FlashInfer side)

Fixes #38729

Change

Restrict supports_trtllm_attention() to exact SM100 (CC 10.0) instead of the CC 10.x family. SM103 falls back to the FlashInfer default attention backend, which works correctly.

Test plan

  • Qwen3-8B-FP8 throughput benchmark on GB300 (SM103): was hang, now passes
  • Qwen2.5-7B BF16 throughput benchmark on GB300 (SM103): was hang, now passes
  • Verify no regression on GB200/B200 (SM100): should still use TRTLLM attention

Changed files

  • docs/design/attention_backends.md (modified, +1/-1)
  • tools/pre_commit/generate_attention_backend_docs.py (modified, +9/-5)
  • vllm/utils/flashinfer.py (modified, +4/-4)

Code Example

# GB300 (SM103), FlashInfer 0.6.7
# Hangs at "Processed prompts: 0%"
vllm bench throughput \
  --tensor-parallel-size=1 --model=nvidia/Qwen3-8B-FP8 \
  --load-format=dummy --num-prompts=768 --output-len=256 --input-len=256 \
  --kv-cache-dtype=auto --gpu-memory-utilization=0.90 \
  --max-num-batched-tokens=2048 --max-num-seqs=768 --max-model-len=2048 \
  --trust-remote-code --quantization=modelopt
RAW_BUFFERClick to expand / collapse

Bug Description

Multiple models hang indefinitely on GB300 (SM103, CC 10.3) during inference with large batch sizes. The GPU shows 99% SM utilization and 0% memory bandwidth. This is a regression introduced by the FlashInfer 0.6.6 to 0.6.7 upgrade, where TRTLLM attention kernels are no longer forward-compatible with SM103.

Correctness tests with small batches (4 prompts) pass. Throughput benchmarks with 768 prompts hang. GB200 (SM100) is not affected.

Related FlashInfer issue: flashinfer-ai/flashinfer#2939 (fix in progress on the FlashInfer side)

Reproduction

# GB300 (SM103), FlashInfer 0.6.7
# Hangs at "Processed prompts: 0%"
vllm bench throughput \
  --tensor-parallel-size=1 --model=nvidia/Qwen3-8B-FP8 \
  --load-format=dummy --num-prompts=768 --output-len=256 --input-len=256 \
  --kv-cache-dtype=auto --gpu-memory-utilization=0.90 \
  --max-num-batched-tokens=2048 --max-num-seqs=768 --max-model-len=2048 \
  --trust-remote-code --quantization=modelopt

Affected Configuration

  • Hardware: GB300 (SM103, CC 10.3). GB200 (SM100, CC 10.0) is not affected.
  • Models: Multiple models tested hang, including FP8 (Qwen3-8B-FP8, Nemotron-Nano-9B-v2-FP8), FP4, and BF16 (Qwen2.5-7B).
  • --enforce-eager still hangs (not a CUDA graph issue).
  • --attention-config.use_trtllm_attention=0 resolves the hang.
  • On multi-GPU nodes, the hang only occurs when TP < number of GPUs. TP equal to the full GPU count does not hang.

Proposed Workaround

Until this is fixed on the FlashInfer side, restrict supports_trtllm_attention() to exact SM100 (CC 10.0) instead of the CC 10.x family. SM103 falls back to the FlashInfer default attention backend, which works correctly. Verified on GB300 hardware with Qwen3-8B-FP8 and Qwen2.5-7B BF16.

extent analysis

TL;DR

Restrict the supports_trtllm_attention() function to exact SM100 (CC 10.0) to workaround the hang issue on GB300 (SM103) hardware.

Guidance

  • Identify the affected hardware and models to confirm the issue is related to the FlashInfer 0.6.6 to 0.6.7 upgrade and TRTLLM attention kernel compatibility.
  • Verify that setting --attention-config.use_trtllm_attention=0 resolves the hang, as this indicates the issue is related to the TRTLLM attention backend.
  • Restrict the supports_trtllm_attention() function to exact SM100 (CC 10.0) to fallback to the FlashInfer default attention backend, which works correctly on SM103 hardware.
  • Test the workaround with multiple models, including FP8, FP4, and BF16, to ensure the issue is fully mitigated.

Example

# Restrict supports_trtllm_attention() to exact SM100 (CC 10.0)
def supports_trtllm_attention():
    # ... existing code ...
    if cc_major == 10 and cc_minor == 0:  # exact SM100 (CC 10.0)
        return True
    # ... existing code ...

Notes

This workaround is temporary until the issue is fixed on the FlashInfer side. The restriction to exact SM100 (CC 10.0) may not be optimal, but it allows the FlashInfer default attention backend to be used on SM103 hardware, which works correctly.

Recommendation

Apply the proposed workaround by restricting the supports_trtllm_attention() function to exact SM100 (CC 10.0), as this has been verified to resolve the hang issue on GB300 (SM103) hardware.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 [1 pull requests, 1 comments, 1 participants]