vllm - 💡(How to fix) Fix [Bug]: b12x NSA+MTP speculative decoding hangs on PCIe TP=8

StepCodex · 2026-05-21T12:34:21Z

[vllm] Your current environment 8× RTX PRO 6000 Blackwell, PCIe-only TP=8, Ubuntu 24.04, vLLM main + b12x 🐛 Describe the bug Bug: b12x NSA + MTP speculative d… ## Fix 1. **NCCL patch**: Force NCCL to respect PCIe topology — route collective operations along the actual PCIe switch hierarchy instead of assuming uniform interconnect. This prevents barrier misalignment between MTP verify all-reduce and b12x KV sync. 2. **Rewrite b12x scheduling logic**: Replace the static CUDA-Graph-only decode path with a topology-aware scheduler that dynamically selects execution path based on current phase (draft / verify / committed). The scheduler is aware of which GPU pairs share PCIe switches and which cross NUMA boundaries, and adjusts sync barriers accordingly. 3. **Topology awareness**: On W790E-SAGE SE with 8 GPUs, PCIe switch topology and NUMA affinity determine TP communication latency distribution. Without topology-aware scheduling, NCCL picks worst-case paths → barriers never align → guaranteed hang. ### Your current environment 8× RTX PRO 6000 Blackwell, PCIe-only TP=8, Ubuntu 24.04, vLLM main + b12x ### 🐛 Describe the bug ## Bug: b12x NSA + MTP speculative decoding hangs on PCIe TP=8 When running GLM-5.1 or Kimi-K2.6 with b12x `B12X_MLA_SPARSE` backend + MTP speculative decoding on PCIe-only 8-GPU setup (TP=8, no NVLink), vLLM **always hangs** during decode. This is not intermittent — it is deterministic. ### Root Cause Three layers of conflict: 1. **NCCL on PCIe topology**: NCCL's default ring/tree communication pattern conflicts with b12x's barrier sync. During MTP verify, the all-reduce for draft token scoring and b12x's cross-GPU KV sync compete for the same PCIe bandwidth. Timing skew → deadlock. 2. **b12x scheduling vs MTP dynamic batch**: b12x's CUDA Graph requires fixed input shapes and static `page_table_1` layout. MTP verify produces variable-length accepted tokens, causing `nsa_cache_seqlens` to diverge from graph-captured state at replay time. 3. **NSA indexer dangling indices**: Draft tokens write temporary KV entries. If verify rejects them, KV cache rolls back but `topk_indices` from the indexer still reference the now-invalid entries → garbage attention output (e.g. XML tag leakage in tool calls). ### Fix 1. **NCCL patch**: Force NCCL to respect PCIe topology — route collective operations along the actual PCIe switch hierarchy instead of assuming uniform interconnect. This prevents barrier misalignment between MTP verify all-reduce and b12x KV sync. 2. **Rewrite b12x scheduling logic**: Replace the static CUDA-Graph-only decode path with a topology-aware scheduler that dynamically selects execution path based on current phase (draft / verify / committed). The scheduler is aware of which GPU pairs share PCIe switches and which cross NUMA boundaries, and adjusts sync barriers accordingly. 3. **Topology awareness**: On W790E-SAGE SE with 8 GPUs, PCIe switch topology and NUMA affinity determine TP communication latency distribution. Without topology-aware scheduling, NCCL picks worst-case paths → barriers never align → guaranteed hang. ### Result After fix: TTFT < 0.3s, no hangs, MTP + b12x NSA coexist correctly. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Root Cause

Three layers of conflict:

NCCL on PCIe topology: NCCL's default ring/tree communication pattern conflicts with b12x's barrier sync. During MTP verify, the all-reduce for draft token scoring and b12x's cross-GPU KV sync compete for the same PCIe bandwidth. Timing skew → deadlock.
b12x scheduling vs MTP dynamic batch: b12x's CUDA Graph requires fixed input shapes and static page_table_1 layout. MTP verify produces variable-length accepted tokens, causing nsa_cache_seqlens to diverge from graph-captured state at replay time.
NSA indexer dangling indices: Draft tokens write temporary KV entries. If verify rejects them, KV cache rolls back but topk_indices from the indexer still reference the now-invalid entries → garbage attention output (e.g. XML tag leakage in tool calls).

Fix Action

Fix

NCCL patch: Force NCCL to respect PCIe topology — route collective operations along the actual PCIe switch hierarchy instead of assuming uniform interconnect. This prevents barrier misalignment between MTP verify all-reduce and b12x KV sync.
Rewrite b12x scheduling logic: Replace the static CUDA-Graph-only decode path with a topology-aware scheduler that dynamically selects execution path based on current phase (draft / verify / committed). The scheduler is aware of which GPU pairs share PCIe switches and which cross NUMA boundaries, and adjusts sync barriers accordingly.
Topology awareness: On W790E-SAGE SE with 8 GPUs, PCIe switch topology and NUMA affinity determine TP communication latency distribution. Without topology-aware scheduling, NCCL picks worst-case paths → barriers never align → guaranteed hang.

Your current environment

8× RTX PRO 6000 Blackwell, PCIe-only TP=8, Ubuntu 24.04, vLLM main + b12x

🐛 Describe the bug

Bug: b12x NSA + MTP speculative decoding hangs on PCIe TP=8

When running GLM-5.1 or Kimi-K2.6 with b12x B12X_MLA_SPARSE backend + MTP speculative decoding on PCIe-only 8-GPU setup (TP=8, no NVLink), vLLM always hangs during decode. This is not intermittent — it is deterministic.

Root Cause

Three layers of conflict:

NCCL on PCIe topology: NCCL's default ring/tree communication pattern conflicts with b12x's barrier sync. During MTP verify, the all-reduce for draft token scoring and b12x's cross-GPU KV sync compete for the same PCIe bandwidth. Timing skew → deadlock.
b12x scheduling vs MTP dynamic batch: b12x's CUDA Graph requires fixed input shapes and static page_table_1 layout. MTP verify produces variable-length accepted tokens, causing nsa_cache_seqlens to diverge from graph-captured state at replay time.
NSA indexer dangling indices: Draft tokens write temporary KV entries. If verify rejects them, KV cache rolls back but topk_indices from the indexer still reference the now-invalid entries → garbage attention output (e.g. XML tag leakage in tool calls).

Fix

NCCL patch: Force NCCL to respect PCIe topology — route collective operations along the actual PCIe switch hierarchy instead of assuming uniform interconnect. This prevents barrier misalignment between MTP verify all-reduce and b12x KV sync.
Rewrite b12x scheduling logic: Replace the static CUDA-Graph-only decode path with a topology-aware scheduler that dynamically selects execution path based on current phase (draft / verify / committed). The scheduler is aware of which GPU pairs share PCIe switches and which cross NUMA boundaries, and adjusts sync barriers accordingly.
Topology awareness: On W790E-SAGE SE with 8 GPUs, PCIe switch topology and NUMA affinity determine TP communication latency distribution. Without topology-aware scheduling, NCCL picks worst-case paths → barriers never align → guaranteed hang.

Result

After fix: TTFT < 0.3s, no hangs, MTP + b12x NSA coexist correctly.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: b12x NSA+MTP speculative decoding hangs on PCIe TP=8 — NCCL topology-aware scheduling fix

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix

Your current environment

🐛 Describe the bug

Bug: b12x NSA + MTP speculative decoding hangs on PCIe TP=8

Root Cause

Fix

Result

Before submitting a new issue...

Still need to ship something?

TRENDING