vllm - 💡(How to fix) Fix [RFC] NSA is architecturally incompatible with NVIDIA consumer/workstation GPUs — MLA is the only viable sparse attention path

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  1. SM120 is not SM100. FP8 block-scaled GEMM kernels written for SM100 crash or produce garbage on SM120 (#26211, #35566). FP8 KV Cache is slower than BF16 on SM120 because there are no native FP8 block-scaled compute units. NVFP4 KV Cache PRs (#21601 in SGLang) remain unmerged with critical bugs.

Fix Action

Fix / Workaround

  1. NSA's correctness is unacceptably fragile. The voipmonitor/GLM-5.1 reference image achieves only 93.3% accuracy (28/30), and requires a custom b12x overlay with three patched source files plus a custom vLLM tree. This is not production-grade — it is a fragile experiment.
RAW_BUFFERClick to expand / collapse

Motivation.

After extensive production testing on SM120 (RTX PRO 6000 Blackwell, 8×96GB, PCIe Gen5), NSA (Native Sparse Attention) cannot be made to work reliably for production serving on NVIDIA consumer/workstation GPUs. This is not a bug — it is an architectural incompatibility.

The fundamental problems:

  1. NSA requires dynamic sparse index computation at every decode step. This means per-layer, per-token topk selection over the full KV cache. On SM100 (datacenter Blackwell), dedicated FP8 block-scaled GEMM units handle this efficiently. On SM120, these units do not exist — the sparse indexer falls back to software paths that are both slow and correctness-problematic.

  2. PCIe topology makes NSA+MTP catastrophically unreliable. The b12x PCIe oneshot allreduce has a buffer reuse race under CUDA graph/no-copy mode. The only known fix (a completion barrier in pcie_oneshot.cu) costs ~1 tok/s and still does not guarantee correctness under concurrent serving. MTP + TP>1 + concurrent>1 produces deadlocks (see #41402, #41404 — both closed without resolution).

  3. SM120 is not SM100. FP8 block-scaled GEMM kernels written for SM100 crash or produce garbage on SM120 (#26211, #35566). FP8 KV Cache is slower than BF16 on SM120 because there are no native FP8 block-scaled compute units. NVFP4 KV Cache PRs (#21601 in SGLang) remain unmerged with critical bugs.

  4. NSA's correctness is unacceptably fragile. The voipmonitor/GLM-5.1 reference image achieves only 93.3% accuracy (28/30), and requires a custom b12x overlay with three patched source files plus a custom vLLM tree. This is not production-grade — it is a fragile experiment.

  5. Every community Docker image for NSA+MTP is broken. The rtx6kpro wiki images hang or produce incoherent output. Later versions are worse than earlier ones because upstream changes in both vLLM and b12x break the fragile integration points.

MLA avoids all of these problems:

  • MLA compresses KV cache structurally (low-rank projection), no per-step sparse indexing needed
  • MLA decode is a standard attention path — no custom indexer, no topk, no CUTE DSL kernel instability
  • MLA works with existing FP8/BF16 KV Cache backends without SM120-specific kernel requirements
  • MLA + MTP integration is straightforward because MLA does not introduce sparse index metadata into the speculative decode path
  • MLA gives comparable or better KV cache compression for most workloads without the correctness fragility of NSA

The empirical evidence is clear: after 7+ days of testing every available NSA+MTP Docker image on 8×RTX PRO 6000, none are production-viable. NSA on SM120 is a dead end. Community effort should pivot to MLA as the only sparse attention path that can reach production reliability on consumer/workstation NVIDIA GPUs.

Proposed Change.

  1. Stop investing engineering effort in NSA kernel development for SM120 GPUs. The architectural gap between SM100 and SM120 makes NSA fundamentally unviable on consumer/workstation hardware.

  2. Prioritize MLA attention backend as the primary sparse/efficient attention path for vLLM, with first-class SM120 support:

    • Native FP8 KV Cache for MLA decode on SM120
    • MLA + MTP speculative decode integration
    • MLA prefix caching and chunked prefill
  3. Document that NSA support on SM120 is experimental/best-effort only, not production-targeted. This saves other teams from repeating the same 7-day dead-end exploration.

  4. Re-evaluate whether the b12x NSA indexer integration should remain in vLLM at all, or whether it should be moved to a separate experimental package to avoid destabilizing the main branch.

Feedback Period.

2 weeks

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING