vllm - ✅(Solved) Fix [Bug]: Qwen3-VL-235B OOM with multi-image long multiturn inputs [2 pull requests, 10 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38257Fetched 2026-04-08 01:36:56
View on GitHub
Comments
10
Participants
4
Timeline
22
Reactions
1
Author
Timeline (top)
commented ×10subscribed ×5mentioned ×3cross-referenced ×2

Error Message

============================== System Info

OS : Ubuntu 24.04.3 LTS (x86_64) GCC version : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.39

============================== PyTorch Info

PyTorch version : 2.10.0+cu129 Is debug build : False CUDA used to build PyTorch : 12.9 ROCM used to build PyTorch : N/A

============================== Python Environment

Python version : 3.12.3 (main, Nov 6 2025, 13:44:16) [GCC 13.3.0] (64-bit runtime) Python platform : Linux-5.14.0-284.118.1.el9_2.x86_64-x86_64-with-glibc2.39

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : 12.9.86 CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA H100 80GB HBM3 GPU 1: NVIDIA H100 80GB HBM3 GPU 2: NVIDIA H100 80GB HBM3 GPU 3: NVIDIA H100 80GB HBM3 GPU 4: NVIDIA H100 80GB HBM3 GPU 5: NVIDIA H100 80GB HBM3 GPU 6: NVIDIA H100 80GB HBM3 GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version : 570.148.08 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6430 CPU family: 6 Model: 143 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 Stepping: 8 CPU(s) scaling MHz: 76% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 128 MiB (64 instances) L3 cache: 120 MiB (2 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

============================== Versions of relevant libraries

[pip3] ema-pytorch==0.7.9 [pip3] flashinfer-python==0.6.7 [pip3] helion==0.3.2 [pip3] mypy-extensions==1.1.0 [pip3] numpy==2.2.6 [pip3] nvidia-cublas-cu12==12.9.1.4 [pip3] nvidia-cuda-cupti-cu12==12.9.79 [pip3] nvidia-cuda-nvrtc-cu12==12.9.86 [pip3] nvidia-cuda-runtime-cu12==12.9.79 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cufft-cu12==11.4.1.4 [pip3] nvidia-cufile-cu12==1.14.1.1 [pip3] nvidia-curand-cu12==10.3.10.19 [pip3] nvidia-cusolver-cu12==11.7.5.82 [pip3] nvidia-cusparse-cu12==12.5.10.65 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.2 [pip3] nvidia-cutlass-dsl-libs-base==4.4.2 [pip3] nvidia-ml-py==13.595.45 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.9.86 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.9.79 [pip3] onnxruntime==1.24.4 [pip3] pyzmq==27.1.0 [pip3] torch==2.10.0+cu129 [pip3] torchaudio==2.10.0+cu129 [pip3] torchsde==0.2.6 [pip3] torchvision==0.25.0+cu129 [pip3] transformers==5.3.0 [pip3] triton==3.6.0 [pip3] x-transformers==2.17.7 [conda] Could not collect

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.18.1rc1.dev149+38de82231 (git sha: 38de82231) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 PIX SYS SYS SYS SYS 0-7 0 N/A GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 PXB SYS SYS SYS SYS 0-7 0 N/A GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 SYS PXB NODE SYS SYS 16-23 2 N/A GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 SYS PIX NODE SYS SYS 16-23 2 N/A GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS PXB SYS 32-39 4 N/A GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS PIX SYS 32-39 4 N/A GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS PXB 48-55 6 N/A GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS PIX 48-55 6 N/A NIC0 PIX PXB SYS SYS SYS SYS SYS SYS X SYS SYS SYS SYS NIC1 SYS SYS PXB PIX SYS SYS SYS SYS SYS X NODE SYS SYS NIC2 SYS SYS NODE NODE SYS SYS SYS SYS SYS NODE X SYS SYS NIC3 SYS SYS SYS SYS PXB PIX SYS SYS SYS SYS SYS X SYS NIC4 SYS SYS SYS SYS SYS SYS PXB PIX SYS SYS SYS SYS X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0 NIC1: mlx5_1 NIC2: mlx5_2 NIC3: mlx5_3 NIC4: mlx5_4

============================== Environment Variables

NVIDIA_VISIBLE_DEVICES=/var/run/nvidia-container-devices CUDA_COREDUMP_SHOW_PROGRESS=1 VLLM_USE_DEEP_GEMM=1 VLLM_CACHE_ROOT=/tmp/cache/vllm VLLM_LOG_MODEL_INSPECTION=1 CUDA_COREDUMP_GENERATION_FLAGS=skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory NVIDIA_GDRCOPY=enabled VLLM_FLOAT32_MATMUL_PRECISION=high NCCL_IB_HCA=mlx5 TORCHINDUCTOR_CACHE_DIR=/tmp/cache/torchinductor_root VLLM_USE_FLASHINFER_SAMPLER=1 VLLM_ALLREDUCE_USE_SYMM_MEM=1 CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1 LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib VLLM_NO_USAGE_STATS=1 CUDA_COREDUMP_FILE=/mnt/models/logs/vllm.%h/cuda_coredump_%p.%t VLLM_USE_FLASHINFER_MOE_FP8=0 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 VLLM_WORKER_MULTIPROC_METHOD=spawn

Root Cause

server OOM on the final turn, client got the following APIError:

...

openai.APIError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 64 On-line CPU(s) list: 0-63 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6430 CPU family: 6 Model: 143 Thread(s) per core: 1 Core(s) per socket: 32 Socket(s): 2 Stepping: 8 CPU(s) scaling MHz: 76% CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 128 MiB (64 instances) L3 cache: 120 MiB (2 instances) NUMA node(s): 8 NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 NUMA node2 CPU(s): 16-23 NUMA node3 CPU(s): 24-31 NUMA node4 CPU(s): 32-39 NUMA node5 CPU(s): 40-47 NUMA node6 CPU(s): 48-55 NUMA node7 CPU(s): 56-63 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #34246: [Core] Simplify multimodal masking

Description (problem / solution / changelog)

Purpose

Since PyTorch 2.9.0 (https://github.com/pytorch/pytorch/pull/156384) target[mask] = src doesn't cause cudaStreamSynchronize anymore in cases where mask is a CPU tensor.

This PR simplifies _merge_multimodal_embeddings by removing the need for masked_scatter_ without re-introducing a CPU/GPU sync. This also simplifies the model runner since the mask doesn't need to be transferred to the GPU anymore.

Test Plan

I verified that for Qwen3VL no cudaStreamSynchronize ops are visible in the torch profile.

I'm keeping this as a draft for now to see what @DarkLight1337 or @ywang96 think of this approach. I'm happy to then also double check some other multi modal models.

vllm serve Qwen/Qwen3-VL-2B-Instruct-FP8 --limit-mm-per-prompt.video 0 --gpu-memory-utilization 0.96

vllm bench serve --backend openai-chat --model Qwen/Qwen3-VL-2B-Instruct-FP8 --endpoint /v1/chat/completions --dataset-name hf --dataset-path lmarena-ai/VisionArena-Chat --hf-split train --num-prompts 1000

Test Result

Inter-token Latency seems to improve, though I'm not sure whether this is just noise. Tested on a single L40s GPU.

Main

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  56.35
Total input tokens:                      94327
Total generated tokens:                  120513
Request throughput (req/s):              17.74
Output token throughput (tok/s):         2138.50
Peak output token throughput (tok/s):    2713.00
Peak concurrent requests:                1000.00
Total token throughput (tok/s):          3812.33
---------------Time to First Token----------------
Mean TTFT (ms):                          28636.18
Median TTFT (ms):                        28416.69
P99 TTFT (ms):                           53127.91
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          32.46
Median TPOT (ms):                        31.65
P99 TPOT (ms):                           46.73
---------------Inter-token Latency----------------
Mean ITL (ms):                           119.67
Median ITL (ms):                         93.25
P99 ITL (ms):                            447.23
==================================================

This PR

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Benchmark duration (s):                  56.63
Total input tokens:                      94327
Total generated tokens:                  120954
Request throughput (req/s):              17.66
Output token throughput (tok/s):         2135.88
Peak output token throughput (tok/s):    4022.00
Peak concurrent requests:                1000.00
Total token throughput (tok/s):          3801.57
---------------Time to First Token----------------
Mean TTFT (ms):                          28940.85
Median TTFT (ms):                        28812.05
P99 TTFT (ms):                           53501.34
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          34.93
Median TPOT (ms):                        30.43
P99 TPOT (ms):                           61.43
---------------Inter-token Latency----------------
Mean ITL (ms):                           98.06
Median ITL (ms):                         34.30
P99 ITL (ms):                            400.73
==================================================

Changed files

  • tests/models/test_utils.py (modified, +29/-3)
  • vllm/model_executor/models/interfaces.py (modified, +3/-1)
  • vllm/model_executor/models/nano_nemotron_vl.py (modified, +2/-3)
  • vllm/model_executor/models/qwen2_5_omni_thinker.py (modified, +6/-8)
  • vllm/model_executor/models/qwen3_omni_moe_thinker.py (modified, +3/-2)
  • vllm/model_executor/models/qwen3_vl.py (modified, +4/-3)
  • vllm/model_executor/models/utils.py (modified, +3/-9)
  • vllm/v1/worker/gpu/mm/encoder_runner.py (modified, +1/-3)
  • vllm/v1/worker/gpu_model_runner.py (modified, +3/-19)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.14.0-284.118.1.el9_2.x86_64-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 570.148.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               64
On-line CPU(s) list:                  0-63
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Gold 6430
CPU family:                           6
Model:                                143
Thread(s) per core:                   1
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             8
CPU(s) scaling MHz:                   76%
CPU max MHz:                          3400.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
L1d cache:                            3 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             128 MiB (64 instances)
L3 cache:                             120 MiB (2 instances)
NUMA node(s):                         8
NUMA node0 CPU(s):                    0-7
NUMA node1 CPU(s):                    8-15
NUMA node2 CPU(s):                    16-23
NUMA node3 CPU(s):                    24-31
NUMA node4 CPU(s):                    32-39
NUMA node5 CPU(s):                    40-47
NUMA node6 CPU(s):                    48-55
NUMA node7 CPU(s):                    56-63
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] ema-pytorch==0.7.9
[pip3] flashinfer-python==0.6.7
[pip3] helion==0.3.2
[pip3] mypy-extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] onnxruntime==1.24.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torchaudio==2.10.0+cu129
[pip3] torchsde==0.2.6
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==5.3.0
[pip3] triton==3.6.0
[pip3] x-transformers==2.17.7
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.18.1rc1.dev149+38de82231 (git sha: 38de82231)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity    GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     0-7     0    N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PXB     SYS     SYS     SYS     SYS     0-7     0    N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PXB     NODE    SYS     SYS     16-23   2    N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     PIX     NODE    SYS     SYS     16-23   2    N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     PXB     SYS     32-39   4    N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     PIX     SYS     32-39   4    N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     PXB     48-55   6    N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     48-55   6    N/A
NIC0    PIX     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC1    SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC2    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC3    SYS     SYS     SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS      X

Legend:

  X = Self
  SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX = Connection traversing at most a single PCIe bridge
  NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=/var/run/nvidia-container-devices
CUDA_COREDUMP_SHOW_PROGRESS=1
VLLM_USE_DEEP_GEMM=1
VLLM_CACHE_ROOT=/tmp/cache/vllm
VLLM_LOG_MODEL_INSPECTION=1
CUDA_COREDUMP_GENERATION_FLAGS=skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory
NVIDIA_GDRCOPY=enabled
VLLM_FLOAT32_MATMUL_PRECISION=high
NCCL_IB_HCA=mlx5
TORCHINDUCTOR_CACHE_DIR=/tmp/cache/torchinductor_root
VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_ALLREDUCE_USE_SYMM_MEM=1
CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib
VLLM_NO_USAGE_STATS=1
CUDA_COREDUMP_FILE=/mnt/models/logs/vllm.%h/cuda_coredump_%p.%t
VLLM_USE_FLASHINFER_MOE_FP8=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
VLLM_WORKER_MULTIPROC_METHOD=spawn

---

# On a H100 x 8 worker node
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --port 8080 \
  --gpu-memory-utilization 0.93 \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 2 \
  --enable-expert-parallel \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --mm-processor-cache-gb 0 \
  --max-model-len 262144 \
  --max-num-batched-tokens 16384 \
  --mm-processor-kwargs.size '{"longest_edge":4194304,"shortest_edge":16384}'

---

import random
import uuid

from openai import OpenAI
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# generate niah-type long question
uuid_puzzle = {str(uuid.uuid4()): str(uuid.uuid4())}
puzzle_template = 'JSON data:\n{uuid_puzzle}\nQ: \nKey: "{uuid_q}"\nThe value associated with the specified key is: '
token_count = 0
while token_count < 120 * 1204:
    for _ in range(8):
        uuid_puzzle[str(uuid.uuid4())] = str(uuid.uuid4())
    uuid_q, uuid_a = random.choice(list(uuid_puzzle.items()))
    puzzle_text = {"role": "user", "content": puzzle_template.format(uuid_q=uuid_q, uuid_puzzle=json.dumps(uuid_puzzle))}
    token_count = len(tokenizer.apply_chat_template(puzzle_text))

# some large images to query - chosen from today's featured article on wikipedia
queries = {
    "https://upload.wikimedia.org/wikipedia/commons/5/5f/Neotype_skeleton_of_Massospondylus_carinatus.jpg": "Explain the image.",
    "https://upload.wikimedia.org/wikipedia/commons/f/f4/Massospondylus_type_material_seeley_1995.png": "Explain the image.",
    "https://upload.wikimedia.org/wikipedia/commons/9/9c/Massospondylus_syntype_series.jpg": "Explain the image."
}

# simulate a long-range multiturn conversation
# First 3 turns query with images, final turn queries with long text
a = None
messages = []
for image_url, q in queries.items():
    if a is not None:
        messages.append({"role": "assistant", "content": [{"type": "text", "text": a}]})
    a = ""
    messages.append({"role": "user", "content": [{"type": "text", "text": q}, {"type": "image_url", "image_url": {"url": image_url}}]})
    for t in client.chat.completions.create(model="Qwen/Qwen3-VL-235B-A22B-Instruct", messages=messages, stream=True):
        if t.choices:
            d = t.choices[0].delta
            print(d.content or "", end="", flush=True)
        if d.content:
            a += d.content
    print("\n========\n")
a = ""
messages.append(puzzle_text)
for t in client.chat.completions.create(model="Qwen/Qwen3-VL-235B-A22B-Instruct", messages=messages, stream=True):
    if t.choices:
        d = t.choices[0].delta
        print(d.content or "", end="", flush=True)
    if d.content:
        a += d.content

# server OOM on the final turn, client got the following APIError:
# ...
# openai.APIError: EngineCore encountered an issue. See stack trace (above) for the root cause.

---

(APIServer pid=1) INFO:     172.16.5.205:48768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 476, in _merge_multimodal_embeddings
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     inputs_embeds.masked_scatter_(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 7 has a total capacity of 79.19 GiB of which 43.00 MiB is free. Including non-PyTorch memory, this process has 79.13 GiB memory in use. Of the allocated memory 72.09 GiB is allocated by PyTorch, with 24.00 MiB allocated in private pools (e.g. CUDA Graphs), and 371.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] 
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] The above exception was the direct cause of the following exception:
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] 
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     output = func(*args, **kwargs)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/worker_base.py", line 332, in execute_model
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     return self.worker.execute_model(scheduler_output)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     return func(*args, **kwargs)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/gpu_worker.py", line 803, in execute_model
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     output = self.model_runner.execute_model(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     return func(*args, **kwargs)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/gpu_model_runner.py", line 3977, in execute_model
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     ) = self._preprocess(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]         ^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/gpu_model_runner.py", line 3219, in _preprocess
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     inputs_embeds_scheduled = self.model.embed_input_ids(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/model_executor/models/qwen3_vl.py", line 2477, in embed_input_ids
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     ) = self._compute_deepstack_embeds(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/model_executor/models/qwen3_vl.py", line 2443, in _compute_deepstack_embeds
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     deepstack_input_embeds = _merge_multimodal_embeddings(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/model_executor/models/utils.py", line 491, in _merge_multimodal_embeddings
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     raise ValueError("Error during masked scatter operation") from e
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] ValueError: Error during masked scatter operation
...
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 24.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version                : Could not collect
CMake version                : Could not collect
Libc version                 : glibc-2.39

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu129
Is debug build               : False
CUDA used to build PyTorch   : 12.9
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.3 (main, Nov  6 2025, 13:44:16) [GCC 13.3.0] (64-bit runtime)
Python platform              : Linux-5.14.0-284.118.1.el9_2.x86_64-x86_64-with-glibc2.39

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 570.148.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               64
On-line CPU(s) list:                  0-63
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Gold 6430
CPU family:                           6
Model:                                143
Thread(s) per core:                   1
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             8
CPU(s) scaling MHz:                   76%
CPU max MHz:                          3400.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
L1d cache:                            3 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             128 MiB (64 instances)
L3 cache:                             120 MiB (2 instances)
NUMA node(s):                         8
NUMA node0 CPU(s):                    0-7
NUMA node1 CPU(s):                    8-15
NUMA node2 CPU(s):                    16-23
NUMA node3 CPU(s):                    24-31
NUMA node4 CPU(s):                    32-39
NUMA node5 CPU(s):                    40-47
NUMA node6 CPU(s):                    48-55
NUMA node7 CPU(s):                    56-63
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] ema-pytorch==0.7.9
[pip3] flashinfer-python==0.6.7
[pip3] helion==0.3.2
[pip3] mypy-extensions==1.1.0
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.9.1.4
[pip3] nvidia-cuda-cupti-cu12==12.9.79
[pip3] nvidia-cuda-nvrtc-cu12==12.9.86
[pip3] nvidia-cuda-runtime-cu12==12.9.79
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.4.1.4
[pip3] nvidia-cufile-cu12==1.14.1.1
[pip3] nvidia-curand-cu12==10.3.10.19
[pip3] nvidia-cusolver-cu12==11.7.5.82
[pip3] nvidia-cusparse-cu12==12.5.10.65
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.2
[pip3] nvidia-cutlass-dsl-libs-base==4.4.2
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.9.86
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.9.79
[pip3] onnxruntime==1.24.4
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0+cu129
[pip3] torchaudio==2.10.0+cu129
[pip3] torchsde==0.2.6
[pip3] torchvision==0.25.0+cu129
[pip3] transformers==5.3.0
[pip3] triton==3.6.0
[pip3] x-transformers==2.17.7
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version : Could not collect
vLLM Version : 0.18.1rc1.dev149+38de82231 (git sha: 38de82231)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    CPU Affinity    NUMA Affinity    GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     0-7     0    N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    PXB     SYS     SYS     SYS     SYS     0-7     0    N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     PXB     NODE    SYS     SYS     16-23   2    N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     PIX     NODE    SYS     SYS     16-23   2    N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     PXB     SYS     32-39   4    N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     PIX     SYS     32-39   4    N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     PXB     48-55   6    N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     PIX     48-55   6    N/A
NIC0    PIX     PXB     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS
NIC1    SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS     SYS      X      NODE    SYS     SYS
NIC2    SYS     SYS     NODE    NODE    SYS     SYS     SYS     SYS     SYS     NODE     X      SYS     SYS
NIC3    SYS     SYS     SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     PXB     PIX     SYS     SYS     SYS     SYS      X

Legend:

  X = Self
  SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX = Connection traversing at most a single PCIe bridge
  NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4

==============================
     Environment Variables
==============================
NVIDIA_VISIBLE_DEVICES=/var/run/nvidia-container-devices
CUDA_COREDUMP_SHOW_PROGRESS=1
VLLM_USE_DEEP_GEMM=1
VLLM_CACHE_ROOT=/tmp/cache/vllm
VLLM_LOG_MODEL_INSPECTION=1
CUDA_COREDUMP_GENERATION_FLAGS=skip_nonrelocated_elf_images,skip_global_memory,skip_shared_memory,skip_local_memory,skip_constbank_memory
NVIDIA_GDRCOPY=enabled
VLLM_FLOAT32_MATMUL_PRECISION=high
NCCL_IB_HCA=mlx5
TORCHINDUCTOR_CACHE_DIR=/tmp/cache/torchinductor_root
VLLM_USE_FLASHINFER_SAMPLER=1
VLLM_ALLREDUCE_USE_SYMM_MEM=1
CUDA_ENABLE_COREDUMP_ON_EXCEPTION=1
LD_LIBRARY_PATH=/usr/local/cuda/lib64:/usr/lib/x86_64-linux-gnu:/usr/lib
VLLM_NO_USAGE_STATS=1
CUDA_COREDUMP_FILE=/mnt/models/logs/vllm.%h/cuda_coredump_%p.%t
VLLM_USE_FLASHINFER_MOE_FP8=0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
</details>

🐛 Describe the bug

Qwen3 VL models OOM on a single large multiturn inputs with multiple images and long texts. The following server setup and the test script are a 100% reproducer on my end:

vLLM deployment spec

# On a H100 x 8 worker node
vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --port 8080 \
  --gpu-memory-utilization 0.93 \
  --tensor-parallel-size 8 \
  --decode-context-parallel-size 2 \
  --enable-expert-parallel \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --mm-processor-cache-gb 0 \
  --max-model-len 262144 \
  --max-num-batched-tokens 16384 \
  --mm-processor-kwargs.size '{"longest_edge":4194304,"shortest_edge":16384}'

client request

import random
import uuid

from openai import OpenAI
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-VL-235B-A22B-Instruct")

# generate niah-type long question
uuid_puzzle = {str(uuid.uuid4()): str(uuid.uuid4())}
puzzle_template = 'JSON data:\n{uuid_puzzle}\nQ: \nKey: "{uuid_q}"\nThe value associated with the specified key is: '
token_count = 0
while token_count < 120 * 1204:
    for _ in range(8):
        uuid_puzzle[str(uuid.uuid4())] = str(uuid.uuid4())
    uuid_q, uuid_a = random.choice(list(uuid_puzzle.items()))
    puzzle_text = {"role": "user", "content": puzzle_template.format(uuid_q=uuid_q, uuid_puzzle=json.dumps(uuid_puzzle))}
    token_count = len(tokenizer.apply_chat_template(puzzle_text))

# some large images to query - chosen from today's featured article on wikipedia
queries = {
    "https://upload.wikimedia.org/wikipedia/commons/5/5f/Neotype_skeleton_of_Massospondylus_carinatus.jpg": "Explain the image.",
    "https://upload.wikimedia.org/wikipedia/commons/f/f4/Massospondylus_type_material_seeley_1995.png": "Explain the image.",
    "https://upload.wikimedia.org/wikipedia/commons/9/9c/Massospondylus_syntype_series.jpg": "Explain the image."
}

# simulate a long-range multiturn conversation
# First 3 turns query with images, final turn queries with long text
a = None
messages = []
for image_url, q in queries.items():
    if a is not None:
        messages.append({"role": "assistant", "content": [{"type": "text", "text": a}]})
    a = ""
    messages.append({"role": "user", "content": [{"type": "text", "text": q}, {"type": "image_url", "image_url": {"url": image_url}}]})
    for t in client.chat.completions.create(model="Qwen/Qwen3-VL-235B-A22B-Instruct", messages=messages, stream=True):
        if t.choices:
            d = t.choices[0].delta
            print(d.content or "", end="", flush=True)
        if d.content:
            a += d.content
    print("\n========\n")
a = ""
messages.append(puzzle_text)
for t in client.chat.completions.create(model="Qwen/Qwen3-VL-235B-A22B-Instruct", messages=messages, stream=True):
    if t.choices:
        d = t.choices[0].delta
        print(d.content or "", end="", flush=True)
    if d.content:
        a += d.content

# server OOM on the final turn, client got the following APIError:
# ...
# openai.APIError: EngineCore encountered an issue. See stack trace (above) for the root cause.

Error log

(APIServer pid=1) INFO:     172.16.5.205:48768 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] WorkerProc hit an exception.
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packages/vllm/model_executor/models/utils.py", line 476, in _merge_multimodal_embeddings
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     inputs_embeds.masked_scatter_(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 48.00 MiB. GPU 7 has a total capacity of 79.19 GiB of which 43.00 MiB is free. Including non-PyTorch memory, this process has 79.13 GiB memory in use. Of the allocated memory 72.09 GiB is allocated by PyTorch, with 24.00 MiB allocated in private pools (e.g. CUDA Graphs), and 371.74 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] 
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] The above exception was the direct cause of the following exception:
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] 
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] Traceback (most recent call last):
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/executor/multiproc_executor.py", line 944, in worker_busy_loop
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     output = func(*args, **kwargs)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/worker_base.py", line 332, in execute_model
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     return self.worker.execute_model(scheduler_output)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     return func(*args, **kwargs)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/gpu_worker.py", line 803, in execute_model
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     output = self.model_runner.execute_model(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     return func(*args, **kwargs)
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/gpu_model_runner.py", line 3977, in execute_model
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     ) = self._preprocess(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]         ^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/v1/worker/gpu_model_runner.py", line 3219, in _preprocess
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     inputs_embeds_scheduled = self.model.embed_input_ids(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/model_executor/models/qwen3_vl.py", line 2477, in embed_input_ids
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     ) = self._compute_deepstack_embeds(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/model_executor/models/qwen3_vl.py", line 2443, in _compute_deepstack_embeds
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     deepstack_input_embeds = _merge_multimodal_embeddings(
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]   File "/app/.venv/lib/python3.12/site-packaged/vllm/model_executor/models/utils.py", line 491, in _merge_multimodal_embeddings
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949]     raise ValueError("Error during masked scatter operation") from e
(Worker_TP7_DCP1_EP7 pid=641) ERROR 03-26 21:14:47 [multiproc_executor.py:949] ValueError: Error during masked scatter operation
...

There are some OOM / VRAM leakage issues (e.g. #28230) reported on Qwen3 VL, but I have no confidence whether my reproducer is in direct relation to them. At least I think the VRAM budget for qwen3 vision tower is incorrect or not set.

I have tried several "fork remedy" like --mm-processor-cache-gb 0 / --mm-encoder-tp-mode weights with no luck. Reducing --gpu-memory-utilization can "delay" the OOM but it eventually crashes after a few iteration.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the OOM issue, we'll focus on optimizing memory usage and adjusting configuration settings.

  1. Reduce GPU Memory Utilization: Decrease --gpu-memory-utilization to a lower value (e.g., 0.8) to allocate less GPU memory for the model.
  2. Optimize Model Cache: Set --mm-processor-cache-gb to a smaller value (e.g., 1) to reduce memory usage.
  3. Adjust Tensor Parallel Size: Try reducing --tensor-parallel-size to decrease the model's memory footprint.
  4. Implement Gradient Checkpointing: Enable gradient checkpointing to store only the gradients of the model's parameters instead of the entire model state.
  5. Increase Batch Size: If possible, increase the batch size to reduce the number of iterations and alleviate memory pressure.

Example configuration changes:

vllm serve Qwen/Qwen3-VL-235B-A22B-Instruct \
  --port 8080 \
  --gpu-memory-utilization 0.8 \
  --tensor-parallel-size 4 \
  --decode-context-parallel-size 2 \
  --enable-expert-parallel \
  --mm-encoder-tp-mode data \
  --mm-processor-cache-type shm \
  --mm-processor-cache-gb 1 \
  --max-model-len 262144 \
  --max-num-batched-tokens 16384 \
  --mm-processor-kwargs.size '{"longest_edge":4194304,"shortest_edge":16384}'

In the client request code, consider adding:

import torch

# Enable gradient checkpointing
torch.cuda.set_per_process_memory_fraction(0.8)

Verification

To verify the fix, run the client request script with the updated configuration and monitor the GPU memory usage. If the OOM issue persists, further adjustments to the configuration settings may be necessary.

Extra Tips

  • Regularly clean up temporary files and cache directories to prevent memory leaks.
  • Consider using a more efficient model or optimizing the existing model architecture to reduce memory usage.
  • If using a GPU with limited memory, consider using a GPU with more memory or distributing the model across multiple GPUs.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING