vllm - 💡(How to fix) Fix [Bug]: Eagle3 speculative decoding CUDA device-side assert crash with gpt-oss-120b under concurrent requests (TP=8, H20) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39295Fetched 2026-04-09 07:52:05
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1renamed ×1

Error Message

(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered (Worker_TP0 pid=483) CUDA kernel errors might be asynchronously reported at some other API call, (Worker_TP0 pid=483) so the stacktrace below might be incorrect.

Root Cause

  • #27626 — same model + Eagle3 combination, "awful benchmarks" (CLOSED but likely same root cause)
  • #35288 — MTP speculative decoding corrupted output at concurrency >= 4 (OPEN, V1 engine)
  • #24392 — Eagle3 out-of-range index fix (MERGED, may be related)

Code Example

vLLM: v0.18.0rc2+dev97507aeb
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPUs: 8x NVIDIA H20
OS: Linux-5.15.0-124-generic-x86_64-with-glibc2.39

---

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.96 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --async-scheduling \
  --speculative-config '{"method":"eagle3","model":"nvidia/gpt-oss-120b-Eagle3-long-context","num_speculative_tokens":5}'

---

(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered
(Worker_TP0 pid=483) CUDA kernel errors might be asynchronously reported at some other API call,
(Worker_TP0 pid=483) so the stacktrace below might be incorrect.

---

(Worker_TP0 pid=483)   File ".../vllm/v1/executor/multiproc_executor.py", line 893, in enqueue_output
(Worker_TP0 pid=483)     output = output.get_output()
(Worker_TP0 pid=483)   File ".../vllm/v1/worker/gpu_model_runner.py", line 261, in get_output
(Worker_TP0 pid=483)     self.async_copy_ready_event.synchronize()
(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered

---

(EngineCore pid=474) ERROR [multiproc_executor.py:273] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.

---

SchedulerOutput(
  scheduled_cached_reqs=CachedRequestData(
    req_ids=['cmpl-8c3ff249c427f5c2-0-a37cc43d'],
    num_computed_tokens=[2695],
    num_output_tokens=[246]
  ),
  num_scheduled_tokens={...: 6},
  total_num_scheduled_tokens=6,
  scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1]}
)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
vLLM: v0.18.0rc2+dev97507aeb
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPUs: 8x NVIDIA H20
OS: Linux-5.15.0-124-generic-x86_64-with-glibc2.39
</details>

🐛 Describe the bug

Eagle3 speculative decoding with openai/gpt-oss-120b crashes with CUDA error: device-side assert triggered after processing a few concurrent requests.

Reproduction

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.96 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --async-scheduling \
  --speculative-config '{"method":"eagle3","model":"nvidia/gpt-oss-120b-Eagle3-long-context","num_speculative_tokens":5}'

Then send ~40 concurrent completion requests. The server starts fine, serves a handful of requests successfully, then crashes:

(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered
(Worker_TP0 pid=483) CUDA kernel errors might be asynchronously reported at some other API call,
(Worker_TP0 pid=483) so the stacktrace below might be incorrect.
(Worker_TP0 pid=483)   File ".../vllm/v1/executor/multiproc_executor.py", line 893, in enqueue_output
(Worker_TP0 pid=483)     output = output.get_output()
(Worker_TP0 pid=483)   File ".../vllm/v1/worker/gpu_model_runner.py", line 261, in get_output
(Worker_TP0 pid=483)     self.async_copy_ready_event.synchronize()
(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered
(EngineCore pid=474) ERROR [multiproc_executor.py:273] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.

After this, the engine is dead and all subsequent requests return 500.

Scheduler output at crash time

SchedulerOutput(
  scheduled_cached_reqs=CachedRequestData(
    req_ids=['cmpl-8c3ff249c427f5c2-0-a37cc43d'],
    num_computed_tokens=[2695],
    num_output_tokens=[246]
  ),
  num_scheduled_tokens={...: 6},
  total_num_scheduled_tokens=6,
  scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1]}
)

Note: scheduled_spec_decode_tokens contains [-1, -1, -1, -1, -1] (all rejected), which may be relevant to the index out-of-bounds.

Additional observations

  1. Single request works fine — the crash only happens under concurrent load (multiple completion streams in parallel).
  2. min_p sampling is not supported — vLLM returns {"error": {"message": "The min_p and logit_bias sampling parameters are not yet supported with speculative decoding."}} inside an SSE stream with HTTP 200. This is a separate issue but worth noting: returning an error inside a 200 SSE stream means raise_for_status() won't catch it.
  3. Without Eagle3, the same model + config works reliably at much higher concurrency.

Expected behavior

Eagle3 speculative decoding should handle concurrent requests without crashing.

Actual behavior

CUDA device-side assert after a few concurrent requests, killing the engine.

Related issues

  • #27626 — same model + Eagle3 combination, "awful benchmarks" (CLOSED but likely same root cause)
  • #35288 — MTP speculative decoding corrupted output at concurrency >= 4 (OPEN, V1 engine)
  • #24392 — Eagle3 out-of-range index fix (MERGED, may be related)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the CUDA device-side assert triggered error with Eagle3 speculative decoding is to adjust the speculative configuration, specifically the num_speculative_tokens parameter, to reduce the concurrency load on the GPUs.

Guidance

  • Review the speculative-config parameter, particularly the num_speculative_tokens value, to ensure it is not causing excessive concurrency that leads to the device-side assert.
  • Consider reducing the --max-num-seqs or --gpu-memory-utilization to decrease the load on the GPUs and prevent the crash.
  • Investigate the scheduled_spec_decode_tokens containing all rejected tokens ([-1, -1, -1, -1, -1]) as it may be related to the index out-of-bounds issue, potentially requiring an update to the speculative decoding logic.
  • Refer to related issues, such as #24392, which may provide insight into fixing the out-of-range index issue with Eagle3.

Example

No specific code example is provided due to the complexity of the issue and the need for further investigation into the speculative decoding configuration and logic.

Notes

The provided information suggests that the issue is related to the concurrency and load on the GPUs when using Eagle3 speculative decoding. Adjusting the speculative configuration and reducing the load on the GPUs may help mitigate the issue. However, a thorough investigation into the scheduled_spec_decode_tokens and the speculative decoding logic is necessary to determine the root cause and provide a definitive fix.

Recommendation

Apply a workaround by adjusting the speculative configuration, specifically reducing the num_speculative_tokens value, to decrease the concurrency load on the GPUs and prevent the crash. This workaround may help stabilize the system until a more permanent fix can be implemented.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING