vllm - ✅(Solved) Fix [Bug]: EAGLE3 speculative decoding + multimodal crash under high concurrency [2 pull requests, 9 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36906Fetched 2026-04-08 00:43:43
View on GitHub
Comments
9
Participants
2
Timeline
25
Reactions
0
Author
Participants
Timeline (top)
commented ×9referenced ×8subscribed ×3cross-referenced ×2

Error Message

On v0.17.1, the server log shows:

RuntimeError: CUDA driver error: device-side assert triggered
...
EngineCore encountered a fatal error.
...
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

The crash occurs in compiled model execution (Inductor/Triton), after which the API returns 500.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 144 On-line CPU(s) list: 0-143 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Platinum 8452Y CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 36 Socket(s): 2 Stepping: 8 CPU max MHz: 3200,0000 CPU min MHz: 800,0000 BogoMIPS: 4000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 3,4 MiB (72 instances) L1i cache: 2,3 MiB (72 instances) L2 cache: 144 MiB (72 instances) L3 cache: 135 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-35,72-107 NUMA node1 CPU(s): 36-71,108-143 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

current workaround

disabling async-scheduling entirely: --no-async-scheduling

PR fix notes

PR #37092: [WIP][Bugfix] Clamp -1 async placeholders to fix CUDA assert in multimodal+EAGLE3

Description (problem / solution / changelog)

Purpose

Fixes #36906.

When async scheduling is used with EAGLE3 speculative decoding and multimodal models (e.g., lightonai/LightOnOCR-2-1B), -1 placeholder token IDs can leak into embedding layers, triggering CUDA error: device-side assert triggered.

Root cause

Async scheduling uses -1 as placeholder for speculative tokens. These -1 values can reach F.embedding() via two paths:

Path 1 (target model): When requests transition from prefill to decode under high concurrency, they may not be in prev_req_id_to_index. Their -1 spec token placeholders in token_ids_cpu survive copy_to_gpu because the scatter in _prepare_input_ids only covers common requests. These reach _preprocess -> model.embed_input_ids.

Path 2 (draft model): get_token_id(seq_lens[i]) returns -1 when async scheduling's seq_lens exceeds known tokens. This propagates through prepare_next_token_ids_padded() backup tokens -> Triton kernel -> set_inputs_first_pass() -> EAGLE proposer's self.input_ids -> propose() -> embed_input_ids.

Fix

Two defensive clamps, one for each path:

  1. gpu_model_runner.py:_prepare_input_ids: self.input_ids.gpu[:total_num_scheduled_tokens].clamp_(min=0) after copy_to_gpu in the async path
  2. eagle.py:propose(): self.input_ids[:num_tokens].clamp_(min=0) before embed_input_ids

Both clamp -1 to 0 (always a valid embedding index). Token 0 at placeholder positions is harmless — the rejection sampler handles wrong spec tokens, and discarded request outputs are ignored.

Changed files

  • tests/v1/worker/test_prepare_input_ids_clamp.py (added, +111/-0)
  • vllm/v1/spec_decode/eagle.py (modified, +6/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +7/-0)

PR #37629: [Bugfix] Fix EAGLE3+async crash: clear stale spec_token_ids for unscheduled requests

Description (problem / solution / changelog)

Purpose

Fixes #36906.

EAGLE3 speculative decoding with async scheduling crashes under high concurrency: CUDA error: device-side assert triggered in F.embedding().

Root cause: In async scheduling, _update_after_schedule sets spec_token_ids = [-1, -1, -1] as placeholders for every decode request. These are cleared when the request is successfully scheduled. However, under high concurrency (e.g., 256 multimodal requests), the scheduler's token budget can be exhausted before visiting all running requests. Unvisited requests retain stale spec_token_ids = [-1, -1, -1], creating a thrashing cycle:

  1. Scheduler skips the request (budget exhausted before reaching it)
  2. Worker removes it from the persistent batch (unscheduled)
  3. Next step: scheduler re-schedules it with stale spec_token_ids = [-1, -1, -1]
  4. Worker re-adds it — writes -1 to token_ids_cpu via update_req_spec_token_ids()
  5. Scatter in _prepare_input_ids skips it (not in prev_req_id_to_index)
  6. -1 reaches F.embedding() → CUDA device-side assert
  7. Back to step 1

Debug log showing the thrashing — request generates 1 token per cycle but is removed/re-added every step:

[sched]: req=8e65: not_visited budget=0 loop_exited_at=65/255  ← not reached by scheduler
[remove]: req=8e65 output_len=6                                ← removed from batch
[add+spec]: req=8e65 spec=[-1,-1,-1] num_output=6              ← re-added with stale -1
[sched]: req=8e65: not_visited budget=0 loop_exited_at=66/256  ← not reached again
[remove]: req=8e65 output_len=7                                ← removed again

Multimodal models trigger this because large image prompts (~1500-2400 tokens) consume disproportionate token budget per prefill chunk, leaving fewer slots for decode requests.

Fix: Clear spec_token_ids for running requests not scheduled in the current step, preventing stale -1 placeholders from persisting across scheduling steps.

Test Plan

  • Serve lightonai/LightOnOCR-2-1B with EAGLE3 speculator (staghado/LightOnOCR-2-1B-speculator-eagle3-bug-report)
  • Send all 1403 images from staghado/olmo-ocr dataset at concurrency 256
  • Verify no CUDA errors and all requests succeed

Test Result

  • Without fix: crashes at ~959/1403 requests (CUDA device-side assert)
  • With fix: 1403/1403 succeed, 0 errors

Changed files

  • tests/v1/core/test_scheduler_spec_stale_placeholders.py (added, +145/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +13/-0)

Code Example

Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ] (64-bit runtime)
Python platform              : Linux-6.2.0-1015-nvidia-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 550.90.07
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             144
On-line CPU(s) list:                0-143
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8452Y
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 36
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3200,0000
CPU min MHz:                        800,0000
BogoMIPS:                           4000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          3,4 MiB (72 instances)
L1i cache:                          2,3 MiB (72 instances)
L2 cache:                           144 MiB (72 instances)
L3 cache:                           135 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-35,72-107
NUMA node1 CPU(s):                  36-71,108-143
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PXB     PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    PXB     PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     PXB     NODE    NODE    SYS     36-71,108-143   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PIX     NODE    NODE    SYS     36-71,108-143   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PIX     PXB     SYS     36-71,108-143   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PXB     PIX     SYS     36-71,108-143   1               N/A
NIC0    PIX     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC1    PXB     PIX     NODE    NODE    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC2    NODE    NODE    PIX     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC3    NODE    NODE    PXB     PIX     SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC4    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      PIX     SYS     SYS     SYS     SYS     NODE
NIC5    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX      X      SYS     SYS     SYS     SYS     NODE
NIC6    SYS     SYS     SYS     SYS     PIX     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    SYS
NIC7    SYS     SYS     SYS     SYS     PXB     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    SYS
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PXB     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     SYS
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      SYS
NIC10   NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_10
  NIC9: mlx5_11
  NIC10: mlx5_bond_0

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_staghado

---

vllm serve lightonai/LightOnOCR-2-1B \
    --port 8040 \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --limit-mm-per-prompt '{"image": 1}' \
    --gpu-memory-utilization 0.96 \
    --speculative-config '{"model": "staghado/LightOnOCR-2-1B-speculator-eagle3-bug-report", "num_speculative_tokens": 3, "method": "eagle3"}'

---

import base64, json, urllib.request
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from huggingface_hub import snapshot_download

url = "http://localhost:8040/v1/chat/completions"
root = Path(snapshot_download("staghado/olmo-ocr", repo_type="dataset"))
imgs = sorted((root / "images" / "old_scans").glob("*.png"))

def run(p):
    body = json.dumps({"model":"lightonai/LightOnOCR-2-1B","messages":[{"role":"user","content":[{"type":"image_url","image_url":{"url":"data:image/png;base64," + base64.b64encode(p.read_bytes()).decode()}}]}],"temperature":0.2,"max_tokens":4096}).encode()
    try: urllib.request.urlopen(urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"}), timeout=600).read(); return True
    except Exception: return False

with ThreadPoolExecutor(max_workers=len(imgs)) as ex:
    print(sum(f.result() for f in [ex.submit(run, p) for p in imgs]), "/", len(imgs))

---

RuntimeError: CUDA driver error: device-side assert triggered
...
EngineCore encountered a fatal error.
...
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Mar 17 2025, 21:01:58) [Clang 20.1.0 ] (64-bit runtime)
Python platform              : Linux-6.2.0-1015-nvidia-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 550.90.07
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             144
On-line CPU(s) list:                0-143
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) Platinum 8452Y
CPU family:                         6
Model:                              143
Thread(s) per core:                 2
Core(s) per socket:                 36
Socket(s):                          2
Stepping:                           8
CPU max MHz:                        3200,0000
CPU min MHz:                        800,0000
BogoMIPS:                           4000.00
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req hfi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                     VT-x
L1d cache:                          3,4 MiB (72 instances)
L1i cache:                          2,3 MiB (72 instances)
L2 cache:                           144 MiB (72 instances)
L3 cache:                           135 MiB (2 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-35,72-107
NUMA node1 CPU(s):                  36-71,108-143
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.1
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    NIC10   CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PXB     PIX     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    NODE    NODE    PIX     PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    NODE    NODE    PXB     PIX     NODE    NODE    SYS     SYS     SYS     SYS     NODE    0-35,72-107     0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     PXB     NODE    NODE    SYS     36-71,108-143   1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PIX     NODE    NODE    SYS     36-71,108-143   1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PIX     PXB     SYS     36-71,108-143   1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PXB     PIX     SYS     36-71,108-143   1               N/A
NIC0    PIX     PXB     NODE    NODE    SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC1    PXB     PIX     NODE    NODE    SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC2    NODE    NODE    PIX     PXB     SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC3    NODE    NODE    PXB     PIX     SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      NODE    NODE    SYS     SYS     SYS     SYS     NODE
NIC4    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE     X      PIX     SYS     SYS     SYS     SYS     NODE
NIC5    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    PIX      X      SYS     SYS     SYS     SYS     NODE
NIC6    SYS     SYS     SYS     SYS     PIX     PXB     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS      X      PXB     NODE    NODE    SYS
NIC7    SYS     SYS     SYS     SYS     PXB     PIX     NODE    NODE    SYS     SYS     SYS     SYS     SYS     SYS     PXB      X      NODE    NODE    SYS
NIC8    SYS     SYS     SYS     SYS     NODE    NODE    PIX     PXB     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE     X      PXB     SYS
NIC9    SYS     SYS     SYS     SYS     NODE    NODE    PXB     PIX     SYS     SYS     SYS     SYS     SYS     SYS     NODE    NODE    PXB      X      SYS
NIC10   NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS     NODE    NODE    NODE    NODE    NODE    NODE    SYS     SYS     SYS     SYS      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_8
  NIC7: mlx5_9
  NIC8: mlx5_10
  NIC9: mlx5_11
  NIC10: mlx5_bond_0

==============================
     Environment Variables
==============================
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_staghado
</details>

🐛 Describe the bug

vllm serve crashes when serving lightonai/LightOnOCR-2-1B with EAGLE3 speculative decoding and default async scheduling.

Some info from debugging:

  • some request orders crash, some do not
  • concurrency 1 is stable
  • --no-async-scheduling avoided the crash in all tested cases so far
  • a dense batch of old_scans images crashes immediately on the first batch

Minimal repro

Server

vllm serve lightonai/LightOnOCR-2-1B \
    --port 8040 \
    --no-enable-prefix-caching \
    --mm-processor-cache-gb 0 \
    --limit-mm-per-prompt '{"image": 1}' \
    --gpu-memory-utilization 0.96 \
    --speculative-config '{"model": "staghado/LightOnOCR-2-1B-speculator-eagle3-bug-report", "num_speculative_tokens": 3, "method": "eagle3"}'

Client

Use 98 old_scans images from staghado/olmo-ocr and send them concurrently to /v1/chat/completions:

import base64, json, urllib.request
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from huggingface_hub import snapshot_download

url = "http://localhost:8040/v1/chat/completions"
root = Path(snapshot_download("staghado/olmo-ocr", repo_type="dataset"))
imgs = sorted((root / "images" / "old_scans").glob("*.png"))

def run(p):
    body = json.dumps({"model":"lightonai/LightOnOCR-2-1B","messages":[{"role":"user","content":[{"type":"image_url","image_url":{"url":"data:image/png;base64," + base64.b64encode(p.read_bytes()).decode()}}]}],"temperature":0.2,"max_tokens":4096}).encode()
    try: urllib.request.urlopen(urllib.request.Request(url, data=body, headers={"Content-Type":"application/json"}), timeout=600).read(); return True
    except Exception: return False

with ThreadPoolExecutor(max_workers=len(imgs)) as ex:
    print(sum(f.result() for f in [ex.submit(run, p) for p in imgs]), "/", len(imgs))

Observed:

  • concurrency 1 with async scheduling: OK
  • concurrency 98 with async scheduling: crash in ~25s, 0 successful requests
  • concurrency 98 with --no-async-scheduling: 98/98 OK

Broader behavior

With all 1403 images at concurrency 256:

  • sorted order: crash at ~880 OK
  • shuffled order (seed=123): crash at ~822 OK
  • shuffled order (seed=42): 1403/1403 OK

So this looks sequence-dependent under async load, not purely cumulative and not a single malformed input.

Error

On v0.17.1, the server log shows:

RuntimeError: CUDA driver error: device-side assert triggered
...
EngineCore encountered a fatal error.
...
vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

The crash occurs in compiled model execution (Inductor/Triton), after which the API returns 500.

Likely cause

The crash trace points to an out-of-bounds token ID reaching the target model embedding path during preprocessing:

  • gpu_model_runner._preprocess -> model.embed_input_ids
  • qwen3 -> qwen2 -> embed_tokens
  • F.embedding(...)
  • CUDA error: device-side assert triggered

So the current hypothesis is that async multimodal + EAGLE3 batching is constructing or propagating an invalid token ID into the target model input preparation path.

current workaround

disabling async-scheduling entirely: --no-async-scheduling

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of the vllm serve crash when serving lightonai/LightOnOCR-2-1B with EAGLE3 speculative decoding and default async scheduling, we will focus on the hypothesis that async multimodal + EAGLE3 batching is constructing or propagating an invalid token ID into the target model input preparation path.

Step 1: Validate Token IDs

Before passing token IDs to the model, validate them to ensure they are within the expected bounds. This can be done by adding a validation step in the preprocessing stage.

def validate_token_ids(token_ids):
    # Assuming max_token_id is the maximum valid token ID
    max_token_id = 10000  # Replace with actual maximum token ID
    if any(token_id >= max_token_id for token_id in token_ids):
        raise ValueError("Invalid token ID encountered")
    return token_ids

# In gpu_model_runner._preprocess
token_ids = validate_token_ids(token_ids)
model_input = model.embed_input_ids(token_ids)

Step 2: Synchronize Async Operations

To prevent concurrent access issues, synchronize the async operations using locks or queues. This ensures that token IDs are not modified or accessed simultaneously by multiple threads.

import threading

token_id_lock = threading.Lock()

def gpu_model_runner._preprocess(token_ids):
    with token_id_lock:
        # Preprocessing code here
        model_input = model.embed_input_ids(token_ids)
        return model_input

Step 3: Implement Retry Mechanism

Implement a retry mechanism to handle cases where the model execution fails due to invalid token IDs. This can be done using a decorator or a wrapper function.

import functools

def retry_on_failure(max_retries=3):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(max_retries):
                try:
                    return func(*args, **kwargs)
                except RuntimeError as e:
                    if attempt < max_retries - 1:
                        continue
                    raise e
        return wrapper
    return decorator

@retry_on_failure(max_retries=3)
def gpu_model_runner._preprocess(token_ids):
    # Preprocessing code here
    model_input = model.embed_input_ids(token_ids)
    return model_input

Verification

To verify the fix, run the client code with the modified server code and check if the crash still occurs. If the crash is resolved, test the server with different concurrency levels and input sequences to ensure the fix is robust.

Extra Tips

  • Regularly review and update the maximum valid token ID to prevent issues with new or updated models.
  • Consider implementing additional logging or monitoring to detect and respond to token ID validation errors.
  • If the issue persists, investigate other potential causes, such as model-specific bugs or environment-related issues

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: EAGLE3 speculative decoding + multimodal crash under high concurrency [2 pull requests, 9 comments, 2 participants]