vllm - ✅(Solved) Fix [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 [1 pull requests, 1 comments, 1 participants]

vllm2026-04-01 15:31:18

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38729•Fetched 2026-04-08 02:23:09

View on GitHub

Comments

Participants

Timeline

Reactions

Author

stecasta

Participants

stecasta

Timeline (top)

closed ×1commented ×1cross-referenced ×1referenced ×1

Fix Action

Fix / Workaround

Proposed Workaround

PR fix notes

PR #38730: [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang

Repository: vllm-project/vllm
Author: stecasta
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/38730

Description (problem / solution / changelog)

Summary

All models hang indefinitely on GB300 (SM103, CC 10.3) during inference with large batch sizes. The GPU shows 99% SM utilization and 0% memory bandwidth. This is a regression introduced by the FlashInfer 0.6.6 to 0.6.7 upgrade, where TRTLLM attention kernels are no longer forward-compatible with SM103.

GB200 (SM100) is not affected.

Related FlashInfer issue: flashinfer-ai/flashinfer#2939 (fix ongoing on the FlashInfer side)

Fixes #38729

Change

Restrict supports_trtllm_attention() to exact SM100 (CC 10.0) instead of the CC 10.x family. SM103 falls back to the FlashInfer default attention backend, which works correctly.

Test plan

Qwen3-8B-FP8 throughput benchmark on GB300 (SM103): was hang, now passes
Qwen2.5-7B BF16 throughput benchmark on GB300 (SM103): was hang, now passes
Verify no regression on GB200/B200 (SM100): should still use TRTLLM attention

Changed files

docs/design/attention_backends.md (modified, +1/-1)
tools/pre_commit/generate_attention_backend_docs.py (modified, +9/-5)
vllm/utils/flashinfer.py (modified, +4/-4)

Code Example

# GB300 (SM103), FlashInfer 0.6.7
# Hangs at "Processed prompts: 0%"
vllm bench throughput \
  --tensor-parallel-size=1 --model=nvidia/Qwen3-8B-FP8 \
  --load-format=dummy --num-prompts=768 --output-len=256 --input-len=256 \
  --kv-cache-dtype=auto --gpu-memory-utilization=0.90 \
  --max-num-batched-tokens=2048 --max-num-seqs=768 --max-model-len=2048 \
  --trust-remote-code --quantization=modelopt

RAW_BUFFERClick to expand / collapse

Bug Description

Multiple models hang indefinitely on GB300 (SM103, CC 10.3) during inference with large batch sizes. The GPU shows 99% SM utilization and 0% memory bandwidth. This is a regression introduced by the FlashInfer 0.6.6 to 0.6.7 upgrade, where TRTLLM attention kernels are no longer forward-compatible with SM103.

Correctness tests with small batches (4 prompts) pass. Throughput benchmarks with 768 prompts hang. GB200 (SM100) is not affected.

Related FlashInfer issue: flashinfer-ai/flashinfer#2939 (fix in progress on the FlashInfer side)

Reproduction

# GB300 (SM103), FlashInfer 0.6.7
# Hangs at "Processed prompts: 0%"
vllm bench throughput \
  --tensor-parallel-size=1 --model=nvidia/Qwen3-8B-FP8 \
  --load-format=dummy --num-prompts=768 --output-len=256 --input-len=256 \
  --kv-cache-dtype=auto --gpu-memory-utilization=0.90 \
  --max-num-batched-tokens=2048 --max-num-seqs=768 --max-model-len=2048 \
  --trust-remote-code --quantization=modelopt

Affected Configuration

Hardware: GB300 (SM103, CC 10.3). GB200 (SM100, CC 10.0) is not affected.
Models: Multiple models tested hang, including FP8 (Qwen3-8B-FP8, Nemotron-Nano-9B-v2-FP8), FP4, and BF16 (Qwen2.5-7B).
--enforce-eager still hangs (not a CUDA graph issue).
--attention-config.use_trtllm_attention=0 resolves the hang.
On multi-GPU nodes, the hang only occurs when TP < number of GPUs. TP equal to the full GPU count does not hang.

Proposed Workaround

Until this is fixed on the FlashInfer side, restrict supports_trtllm_attention() to exact SM100 (CC 10.0) instead of the CC 10.x family. SM103 falls back to the FlashInfer default attention backend, which works correctly. Verified on GB300 hardware with Qwen3-8B-FP8 and Qwen2.5-7B BF16.

extent analysis

TL;DR

Restrict the supports_trtllm_attention() function to exact SM100 (CC 10.0) to workaround the hang issue on GB300 (SM103) hardware.

Guidance

Identify the affected hardware and models to confirm the issue is related to the FlashInfer 0.6.6 to 0.6.7 upgrade and TRTLLM attention kernel compatibility.
Verify that setting --attention-config.use_trtllm_attention=0 resolves the hang, as this indicates the issue is related to the TRTLLM attention backend.
Restrict the supports_trtllm_attention() function to exact SM100 (CC 10.0) to fallback to the FlashInfer default attention backend, which works correctly on SM103 hardware.
Test the workaround with multiple models, including FP8, FP4, and BF16, to ensure the issue is fully mitigated.

Example

# Restrict supports_trtllm_attention() to exact SM100 (CC 10.0)
def supports_trtllm_attention():
    # ... existing code ...
    if cc_major == 10 and cc_minor == 0:  # exact SM100 (CC 10.0)
        return True
    # ... existing code ...

Notes

This workaround is temporary until the issue is fixed on the FlashInfer side. The restriction to exact SM100 (CC 10.0) may not be optimal, but it allows the FlashInfer default attention backend to be used on SM103 hardware, which works correctly.

Recommendation

Apply the proposed workaround by restricting the supports_trtllm_attention() function to exact SM100 (CC 10.0), as this has been verified to resolve the hang issue on GB300 (SM103) hardware.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#mixed precision #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Proposed Workaround

PR fix notes

PR #38730: [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang

Description (problem / solution / changelog)

Summary

Change

Test plan

Changed files

Code Example

Bug Description

Reproduction

Affected Configuration

Proposed Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug] All models hang on GB300 (SM103) with FlashInfer 0.6.7 [1 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Proposed Workaround

PR fix notes

PR #38730: [Bugfix] Restrict TRTLLM attention to SM100, fixing GB300 (SM103) hang

Description (problem / solution / changelog)

Summary

Change

Test plan

Changed files

Code Example

Bug Description

Reproduction

Affected Configuration

Proposed Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING