vllm - 💡(How to fix) Fix [Bug]: Eagle3 speculative decoding CUDA device-side assert crash with gpt-oss-120b under concurrent requests (TP=8, H20) [1 participants]

Code Example

vLLM: v0.18.0rc2+dev97507aeb
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPUs: 8x NVIDIA H20
OS: Linux-5.15.0-124-generic-x86_64-with-glibc2.39

---

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.96 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --async-scheduling \
  --speculative-config '{"method":"eagle3","model":"nvidia/gpt-oss-120b-Eagle3-long-context","num_speculative_tokens":5}'

---

(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered
(Worker_TP0 pid=483) CUDA kernel errors might be asynchronously reported at some other API call,
(Worker_TP0 pid=483) so the stacktrace below might be incorrect.

---

(Worker_TP0 pid=483)   File ".../vllm/v1/executor/multiproc_executor.py", line 893, in enqueue_output
(Worker_TP0 pid=483)     output = output.get_output()
(Worker_TP0 pid=483)   File ".../vllm/v1/worker/gpu_model_runner.py", line 261, in get_output
(Worker_TP0 pid=483)     self.async_copy_ready_event.synchronize()
(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered

---

(EngineCore pid=474) ERROR [multiproc_executor.py:273] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.

---

SchedulerOutput(
  scheduled_cached_reqs=CachedRequestData(
    req_ids=['cmpl-8c3ff249c427f5c2-0-a37cc43d'],
    num_computed_tokens=[2695],
    num_output_tokens=[246]
  ),
  num_scheduled_tokens={...: 6},
  total_num_scheduled_tokens=6,
  scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1]}
)

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM: v0.18.0rc2+dev97507aeb
PyTorch: 2.10.0a0+gitb0eb5f7
CUDA: 12.9
GPUs: 8x NVIDIA H20
OS: Linux-5.15.0-124-generic-x86_64-with-glibc2.39

</details>

🐛 Describe the bug

Eagle3 speculative decoding with openai/gpt-oss-120b crashes with CUDA error: device-side assert triggered after processing a few concurrent requests.

Reproduction

vllm serve openai/gpt-oss-120b \
  --tensor-parallel-size 8 \
  --max-num-seqs 256 \
  --gpu-memory-utilization 0.96 \
  --kv-cache-dtype fp8_e4m3 \
  --max-model-len 65536 \
  --enable-prefix-caching \
  --async-scheduling \
  --speculative-config '{"method":"eagle3","model":"nvidia/gpt-oss-120b-Eagle3-long-context","num_speculative_tokens":5}'

Then send ~40 concurrent completion requests. The server starts fine, serves a handful of requests successfully, then crashes:

(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered
(Worker_TP0 pid=483) CUDA kernel errors might be asynchronously reported at some other API call,
(Worker_TP0 pid=483) so the stacktrace below might be incorrect.

(Worker_TP0 pid=483)   File ".../vllm/v1/executor/multiproc_executor.py", line 893, in enqueue_output
(Worker_TP0 pid=483)     output = output.get_output()
(Worker_TP0 pid=483)   File ".../vllm/v1/worker/gpu_model_runner.py", line 261, in get_output
(Worker_TP0 pid=483)     self.async_copy_ready_event.synchronize()
(Worker_TP0 pid=483) torch.AcceleratorError: CUDA error: device-side assert triggered

(EngineCore pid=474) ERROR [multiproc_executor.py:273] Worker proc VllmWorker-3 died unexpectedly, shutting down executor.

After this, the engine is dead and all subsequent requests return 500.

Scheduler output at crash time

SchedulerOutput(
  scheduled_cached_reqs=CachedRequestData(
    req_ids=['cmpl-8c3ff249c427f5c2-0-a37cc43d'],
    num_computed_tokens=[2695],
    num_output_tokens=[246]
  ),
  num_scheduled_tokens={...: 6},
  total_num_scheduled_tokens=6,
  scheduled_spec_decode_tokens={...: [-1, -1, -1, -1, -1]}
)

Note: scheduled_spec_decode_tokens contains [-1, -1, -1, -1, -1] (all rejected), which may be relevant to the index out-of-bounds.

Additional observations

Single request works fine — the crash only happens under concurrent load (multiple completion streams in parallel).
min_p sampling is not supported — vLLM returns {"error": {"message": "The min_p and logit_bias sampling parameters are not yet supported with speculative decoding."}} inside an SSE stream with HTTP 200. This is a separate issue but worth noting: returning an error inside a 200 SSE stream means raise_for_status() won't catch it.
Without Eagle3, the same model + config works reliably at much higher concurrency.

Expected behavior

Eagle3 speculative decoding should handle concurrent requests without crashing.

Actual behavior

CUDA device-side assert after a few concurrent requests, killing the engine.

Related issues

#27626 — same model + Eagle3 combination, "awful benchmarks" (CLOSED but likely same root cause)
#35288 — MTP speculative decoding corrupted output at concurrency >= 4 (OPEN, V1 engine)
#24392 — Eagle3 out-of-range index fix (MERGED, may be related)

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix for the CUDA device-side assert triggered error with Eagle3 speculative decoding is to adjust the speculative configuration, specifically the num_speculative_tokens parameter, to reduce the concurrency load on the GPUs.

Guidance

Review the speculative-config parameter, particularly the num_speculative_tokens value, to ensure it is not causing excessive concurrency that leads to the device-side assert.
Consider reducing the --max-num-seqs or --gpu-memory-utilization to decrease the load on the GPUs and prevent the crash.
Investigate the scheduled_spec_decode_tokens containing all rejected tokens ([-1, -1, -1, -1, -1]) as it may be related to the index out-of-bounds issue, potentially requiring an update to the speculative decoding logic.
Refer to related issues, such as #24392, which may provide insight into fixing the out-of-range index issue with Eagle3.

Example

No specific code example is provided due to the complexity of the issue and the need for further investigation into the speculative decoding configuration and logic.

Notes

The provided information suggests that the issue is related to the concurrency and load on the GPUs when using Eagle3 speculative decoding. Adjusting the speculative configuration and reducing the load on the GPUs may help mitigate the issue. However, a thorough investigation into the scheduled_spec_decode_tokens and the speculative decoding logic is necessary to determine the root cause and provide a definitive fix.

Recommendation

Apply a workaround by adjusting the speculative configuration, specifically reducing the num_speculative_tokens value, to decrease the concurrency load on the GPUs and prevent the crash. This workaround may help stabilize the system until a more permanent fix can be implemented.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Eagle3 speculative decoding CUDA device-side assert crash with gpt-oss-120b under concurrent requests (TP=8, H20) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Scheduler output at crash time

Additional observations

Expected behavior

Actual behavior

Related issues

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Eagle3 speculative decoding CUDA device-side assert crash with gpt-oss-120b under concurrent requests (TP=8, H20) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Reproduction

Scheduler output at crash time

Additional observations

Expected behavior

Actual behavior

Related issues

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING