vllm - 💡(How to fix) Fix [Bug]: b12x NSA+MTP speculative decoding hangs on PCIe TP=8 — NCCL topology-aware scheduling fix

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Three layers of conflict:

  1. NCCL on PCIe topology: NCCL's default ring/tree communication pattern conflicts with b12x's barrier sync. During MTP verify, the all-reduce for draft token scoring and b12x's cross-GPU KV sync compete for the same PCIe bandwidth. Timing skew → deadlock.

  2. b12x scheduling vs MTP dynamic batch: b12x's CUDA Graph requires fixed input shapes and static page_table_1 layout. MTP verify produces variable-length accepted tokens, causing nsa_cache_seqlens to diverge from graph-captured state at replay time.

  3. NSA indexer dangling indices: Draft tokens write temporary KV entries. If verify rejects them, KV cache rolls back but topk_indices from the indexer still reference the now-invalid entries → garbage attention output (e.g. XML tag leakage in tool calls).

Fix Action

Fix

  1. NCCL patch: Force NCCL to respect PCIe topology — route collective operations along the actual PCIe switch hierarchy instead of assuming uniform interconnect. This prevents barrier misalignment between MTP verify all-reduce and b12x KV sync.

  2. Rewrite b12x scheduling logic: Replace the static CUDA-Graph-only decode path with a topology-aware scheduler that dynamically selects execution path based on current phase (draft / verify / committed). The scheduler is aware of which GPU pairs share PCIe switches and which cross NUMA boundaries, and adjusts sync barriers accordingly.

  3. Topology awareness: On W790E-SAGE SE with 8 GPUs, PCIe switch topology and NUMA affinity determine TP communication latency distribution. Without topology-aware scheduling, NCCL picks worst-case paths → barriers never align → guaranteed hang.

RAW_BUFFERClick to expand / collapse

Your current environment

8× RTX PRO 6000 Blackwell, PCIe-only TP=8, Ubuntu 24.04, vLLM main + b12x

🐛 Describe the bug

Bug: b12x NSA + MTP speculative decoding hangs on PCIe TP=8

When running GLM-5.1 or Kimi-K2.6 with b12x B12X_MLA_SPARSE backend + MTP speculative decoding on PCIe-only 8-GPU setup (TP=8, no NVLink), vLLM always hangs during decode. This is not intermittent — it is deterministic.

Root Cause

Three layers of conflict:

  1. NCCL on PCIe topology: NCCL's default ring/tree communication pattern conflicts with b12x's barrier sync. During MTP verify, the all-reduce for draft token scoring and b12x's cross-GPU KV sync compete for the same PCIe bandwidth. Timing skew → deadlock.

  2. b12x scheduling vs MTP dynamic batch: b12x's CUDA Graph requires fixed input shapes and static page_table_1 layout. MTP verify produces variable-length accepted tokens, causing nsa_cache_seqlens to diverge from graph-captured state at replay time.

  3. NSA indexer dangling indices: Draft tokens write temporary KV entries. If verify rejects them, KV cache rolls back but topk_indices from the indexer still reference the now-invalid entries → garbage attention output (e.g. XML tag leakage in tool calls).

Fix

  1. NCCL patch: Force NCCL to respect PCIe topology — route collective operations along the actual PCIe switch hierarchy instead of assuming uniform interconnect. This prevents barrier misalignment between MTP verify all-reduce and b12x KV sync.

  2. Rewrite b12x scheduling logic: Replace the static CUDA-Graph-only decode path with a topology-aware scheduler that dynamically selects execution path based on current phase (draft / verify / committed). The scheduler is aware of which GPU pairs share PCIe switches and which cross NUMA boundaries, and adjusts sync barriers accordingly.

  3. Topology awareness: On W790E-SAGE SE with 8 GPUs, PCIe switch topology and NUMA affinity determine TP communication latency distribution. Without topology-aware scheduling, NCCL picks worst-case paths → barriers never align → guaranteed hang.

Result

After fix: TTFT < 0.3s, no hangs, MTP + b12x NSA coexist correctly.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING