vllm - 💡(How to fix) Fix [Bug]: GDN attention `mamba_get_block_table_tensor` torch.gather index out of bounds when prefix caching + num_speculative_tokens>=10 (DFlash, DGX Spark sm_121a, Qwen3.6 hybrid)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Running the official docker-compose.spark-xs.yml from AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash (file-byte-identical, no edits) on a DGX Spark, the container starts cleanly (autotuner + cudagraph capture + warmup all complete in ~10 min, "Application startup complete") but the very first inference request triggers a CUDA device-side assert and crashes the engine.

Reducing num_speculative_tokens from 15 to 5 stabilizes it (~31 tok/s). Disabling prefix caching also works (~31 tok/s, with worse drafter utilization). The bug only manifests when prefix caching = on AND num_speculative_tokens >= ~10. Either fix in isolation works.

Error Message

File "vllm/v1/worker/gpu_model_runner.py", line 4001, in execute_model
  self._build_attention_metadata(...)
File "vllm/v1/worker/gpu_model_runner.py", line 2328, in _build_attention_metadata
  _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
File "vllm/v1/worker/gpu_model_runner.py", line 2279, in _build_attn_group_metadata
  attn_metadata_i = builder.build(...)
File "vllm/v1/attention/backends/gdn_attn.py", line 170, in build
  block_table_tensor = mamba_get_block_table_tensor(...)
File "vllm/v1/attention/backends/utils.py", line 898, in mamba_get_block_table_tensor
  return torch.gather(block_table, 1, indices_to_gather)
torch.AcceleratorError: CUDA error: device-side assert triggered

Root Cause

Likely root cause

Fix Action

Fix / Workaround

ghcr.io/aeon-7/vllm-aeon-ultimate-dflash@sha256:6506ebcb79b1bd0d48f8afca127984791f32345333be1be0fef334eaa5a9e23a (qwen36-v3 tag, AEON-7 build of vLLM v0.20.1.dev0+g88d34c640.d20260428 with FlashInfer v0.6.9 stable + 5 sm_121a patches).

Workarounds

Performance after workaround (note for context, not a perf complaint)

Code Example

docker compose -f docker-compose.spark-xs.yml up -d
# wait 10 min for startup complete
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"aeon-ultimate","messages":[{"role":"user","content":"Hello"}]}'
# → engine crashes with CUDA assert; container restarts; same crash on next request

---

inductor_cache/36/c36og33myocrdktrph3vromvxbnvzez3o4gxzjlk3zs2fhfitfcz.py:41: unknown:
  block: [0,0,0], thread: [N,0,0]
  Assertion `index out of bounds: 0 <= tmp5 < 248320` failed.
  (repeated for thread 112..255)

---

File "vllm/v1/worker/gpu_model_runner.py", line 4001, in execute_model
  self._build_attention_metadata(...)
File "vllm/v1/worker/gpu_model_runner.py", line 2328, in _build_attention_metadata
  _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
File "vllm/v1/worker/gpu_model_runner.py", line 2279, in _build_attn_group_metadata
  attn_metadata_i = builder.build(...)
File "vllm/v1/attention/backends/gdn_attn.py", line 170, in build
  block_table_tensor = mamba_get_block_table_tensor(...)
File "vllm/v1/attention/backends/utils.py", line 898, in mamba_get_block_table_tensor
  return torch.gather(block_table, 1, indices_to_gather)
torch.AcceleratorError: CUDA error: device-side assert triggered

---

SchedulerOutput(
  scheduled_cached_reqs=CachedRequestData(req_ids=["chatcmpl-..."],
    new_token_ids_lens=[], all_token_ids_lens={}, new_block_ids=[None],
    num_computed_tokens=[48], num_output_tokens=[33]),
  num_scheduled_tokens={...: 16}, total_num_scheduled_tokens=16,
  scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]},
  num_common_prefix_blocks=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
  ...
)

---

--speculative-config '{"method":"dflash","model":"/models/dflash-drafter","num_speculative_tokens":5}'
RAW_BUFFERClick to expand / collapse

Summary

Running the official docker-compose.spark-xs.yml from AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash (file-byte-identical, no edits) on a DGX Spark, the container starts cleanly (autotuner + cudagraph capture + warmup all complete in ~10 min, "Application startup complete") but the very first inference request triggers a CUDA device-side assert and crashes the engine.

Reducing num_speculative_tokens from 15 to 5 stabilizes it (~31 tok/s). Disabling prefix caching also works (~31 tok/s, with worse drafter utilization). The bug only manifests when prefix caching = on AND num_speculative_tokens >= ~10. Either fix in isolation works.

Repro

docker-compose.spark-xs.yml from AEON-7/Qwen3.6-27B-AEON-Ultimate-Uncensored-DFlash@main (commit on 2026-04-28), no edits.

docker compose -f docker-compose.spark-xs.yml up -d
# wait 10 min for startup complete
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"aeon-ultimate","messages":[{"role":"user","content":"Hello"}]}'
# → engine crashes with CUDA assert; container restarts; same crash on next request

Key config bits:

  • --enable-prefix-caching
  • --speculative-config '{"method":"dflash","model":"/models/dflash-drafter","num_speculative_tokens":15}'
  • --tensor-parallel-size 1
  • --quantization modelopt (NVFP4 + DFlash drafter)
  • --attention-backend flash_attn

Hardware

  • DGX Spark, NVIDIA GB10 (sm_121a), 128 GB unified memory
  • Driver 580.82.09 / CUDA 13.0
  • TP=1 (note: this is distinct from #41190 which is TP=2-specific)

Image digest

ghcr.io/aeon-7/vllm-aeon-ultimate-dflash@sha256:6506ebcb79b1bd0d48f8afca127984791f32345333be1be0fef334eaa5a9e23a (qwen36-v3 tag, AEON-7 build of vLLM v0.20.1.dev0+g88d34c640.d20260428 with FlashInfer v0.6.9 stable + 5 sm_121a patches).

Assertion

inductor_cache/36/c36og33myocrdktrph3vromvxbnvzez3o4gxzjlk3zs2fhfitfcz.py:41: unknown:
  block: [0,0,0], thread: [N,0,0]
  Assertion `index out of bounds: 0 <= tmp5 < 248320` failed.
  (repeated for thread 112..255)

Stacktrace

File "vllm/v1/worker/gpu_model_runner.py", line 4001, in execute_model
  self._build_attention_metadata(...)
File "vllm/v1/worker/gpu_model_runner.py", line 2328, in _build_attention_metadata
  _build_attn_group_metadata(kv_cache_gid, attn_gid, cm)
File "vllm/v1/worker/gpu_model_runner.py", line 2279, in _build_attn_group_metadata
  attn_metadata_i = builder.build(...)
File "vllm/v1/attention/backends/gdn_attn.py", line 170, in build
  block_table_tensor = mamba_get_block_table_tensor(...)
File "vllm/v1/attention/backends/utils.py", line 898, in mamba_get_block_table_tensor
  return torch.gather(block_table, 1, indices_to_gather)
torch.AcceleratorError: CUDA error: device-side assert triggered

Scheduler dump (corroborating)

SchedulerOutput(
  scheduled_cached_reqs=CachedRequestData(req_ids=["chatcmpl-..."],
    new_token_ids_lens=[], all_token_ids_lens={}, new_block_ids=[None],
    num_computed_tokens=[48], num_output_tokens=[33]),
  num_scheduled_tokens={...: 16}, total_num_scheduled_tokens=16,
  scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]},
  num_common_prefix_blocks=[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1],
  ...
)

The num_common_prefix_blocks is inconsistent across the 15 spec speculation tokens — first 10 tokens see 0 prefix blocks, last 5 see 1. This appears to be the bookkeeping mismatch that drives indices_to_gather out of range when GDN attention builds its mamba block table for the spec speculation step.

Workarounds

Reducing num_speculative_tokens from 15 to 5 makes the bug disappear:

--speculative-config '{"method":"dflash","model":"/models/dflash-drafter","num_speculative_tokens":5}'

Also stable: setting --no-enable-prefix-caching (keeps num_speculative_tokens=15).

The bug only manifests when prefix caching = on AND num_speculative_tokens >= ~10.

Performance after workaround (note for context, not a perf complaint)

Configtok/s (thinking-on, 400 tok, avg of 10)
Official k=15 + APCcrashes
k=5 + APC31.0 tok/s
no-APC + k=1531.0 tok/s (more drafter waste)

SpecDecoding metrics with k=5: mean acceptance length 3.45–3.65, per-position acceptance rate 0.85 / 0.65 / 0.45 / 0.35 / 0.25, avg draft acceptance rate ~52%. Both fixes give the same throughput, suggesting throughput on Spark is bounded by something other than spec length here — but this issue is filed about the crash, not the throughput.

Likely root cause

The interaction between enable_prefix_caching and num_speculative_tokens=15 in the GDN attention metadata builder for hybrid Mamba+attention models (Qwen3.6 architecture). Specifically mamba_get_block_table_tensor computes indices_to_gather for the speculation block layout, but the per-spec-token prefix-block bookkeeping is not consistent across all 15 speculation positions, leading to out-of-range gather indices when the speculation length is large.

I haven't dug deeper yet — happy to provide additional logs or run targeted patches. Full crash log preserved (766 lines).

Related issues (for triage cross-linking)

  • #36917 — GDN backend assert on mixed decode/spec_decode batch (Qwen3.5 + MTP). Different assert site (gdn_attn.py:310, batch heterogeneity), reporter also on Spark. May share underlying scheduler/GDN bookkeeping issue.
  • #34948 — Qwen3.5 CUDA Illegal Memory Access in GDN Kernel (general).
  • #34993 — GDN backend assertion failure with MTP.
  • #37035 — gdn_attn.py:237 cudaErrorIllegalAddress with qwen3_next_mtp num_spec=5 under load. Different trigger (load, k=5) but same backend.
  • #40756 — Qwen3.6-27B-FP8 + MTP + long sequences (>26K) cudaErrorIllegalAddress. Different trigger (long context, MTP not DFlash) but same model family.
  • #41190 — TP=2 DFlash/qwen3_next_mtp on Qwen3.6 hybrid GDN cudaErrorIllegalAddress. TP=2 specific — this issue is TP=1, distinct.
  • PR #26807 — GatedDeltaNet APC all-mode. Switching to --mamba-cache-mode all may avoid the align-mode code path that this issue hits.
  • PR #38020 — Perf optimization fusing mamba_get_block_table_tensor in align mode — touches the exact function but for perf, not correctness.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: GDN attention `mamba_get_block_table_tensor` torch.gather index out of bounds when prefix caching + num_speculative_tokens>=10 (DFlash, DGX Spark sm_121a, Qwen3.6 hybrid)