vllm - 💡(How to fix) Fix [Bug]: TP=2 DP=2 Broken for Qwen3-Next W4A16 [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36793Fetched 2026-04-08 00:34:39
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Participants
Timeline (top)
renamed ×2commented ×1labeled ×1

Error Message

(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] torch.AcceleratorError: CUDA error: an illegal memory access was encountered (EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] Search for cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. (EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. (EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] For debugging consider passing CUDA_LAUNCH_BLOCKING=1 (EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Vendor ID: GenuineIntel Model name: Intel Xeon Processor (Icelake) CPU family: 6 Model: 134 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 Stepping: 0 BogoMIPS: 5600.03 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities indirect_thunk_its Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 2.5 MiB (80 instances) L1i cache: 2.5 MiB (80 instances) L2 cache: 160 MiB (40 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-39 NUMA node1 CPU(s): 40-79 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Vulnerable: No microcode Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 46 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 80 On-line CPU(s) list: 0-79 Vendor ID: GenuineIntel Model name: Intel Xeon Processor (Icelake) CPU family: 6 Model: 134 Thread(s) per core: 2 Core(s) per socket: 20 Socket(s): 2 Stepping: 0 BogoMIPS: 5600.03 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities indirect_thunk_its Virtualization: VT-x Hypervisor vendor: KVM Virtualization type: full L1d cache: 2.5 MiB (80 instances) L1i cache: 2.5 MiB (80 instances) L2 cache: 160 MiB (40 instances) L3 cache: 32 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-39 NUMA node1 CPU(s): 40-79 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown Vulnerability Reg file data sampling: Vulnerable: No microcode Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Code Example

Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Aug 14 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] (64-bit runtime)
Python platform              : Linux-5.14.0-570.58.1.el9_6.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  80
On-line CPU(s) list:                     0-79
Vendor ID:                               GenuineIntel
Model name:                              Intel Xeon Processor (Icelake)
CPU family:                              6
Model:                                   134
Thread(s) per core:                      2
Core(s) per socket:                      20
Socket(s):                               2
Stepping:                                0
BogoMIPS:                                5600.03
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities indirect_thunk_its
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               2.5 MiB (80 instances)
L1i cache:                               2.5 MiB (80 instances)
L2 cache:                                160 MiB (40 instances)
L3 cache:                                32 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-39
NUMA node1 CPU(s):                       40-79
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling:    Vulnerable: No microcode
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.19.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	40-79	1		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	40-79	1		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	40-79	1		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	40-79	1		N/A
NIC0	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=:/usr/local/cuda-12.9/lib64
CUDA_HOME=/usr/local/cuda-12.9
CUDA_HOME=/usr/local/cuda-12.9
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_Chibukach

---

Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Aug 14 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] (64-bit runtime)
Python platform              : Linux-5.14.0-570.58.1.el9_6.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  80
On-line CPU(s) list:                     0-79
Vendor ID:                               GenuineIntel
Model name:                              Intel Xeon Processor (Icelake)
CPU family:                              6
Model:                                   134
Thread(s) per core:                      2
Core(s) per socket:                      20
Socket(s):                               2
Stepping:                                0
BogoMIPS:                                5600.03
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities indirect_thunk_its
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               2.5 MiB (80 instances)
L1i cache:                               2.5 MiB (80 instances)
L2 cache:                                160 MiB (40 instances)
L3 cache:                                32 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-39
NUMA node1 CPU(s):                       40-79
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling:    Vulnerable: No microcode
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.0
[pip3] nvidia-cutlass-dsl-libs-base==4.4.0
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1
[pip3] torchaudio==2.9.1
[pip3] torchvision==0.24.1
[pip3] transformers==5.1.0
[pip3] triton==3.5.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.15.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	40-79	1		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	40-79	1		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	40-79	1		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	40-79	1		N/A
NIC0	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=:/usr/local/cuda-12.9/lib64
CUDA_HOME=/usr/local/cuda-12.9
CUDA_HOME=/usr/local/cuda-12.9
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

chg run --gpus 4 --  vllm  serve inference-optimization/Qwen3-Coder-Next.w4a16 --enable-auto-tool-choice --tool-call-parser qwen3_coder --data-parallel-size 2 --tensor-parallel-size 2

---

(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

---

Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880] WorkerProc hit an exception.
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880] Traceback (most recent call last):
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 875, in worker_busy_loop
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     output = func(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]              ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return func(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 390, in determine_available_memory
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     self.model_runner.profile_run()
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5282, in profile_run
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     hidden_states, last_hidden_states = self._dummy_run(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]                                         ^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return func(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     outputs = self.model(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]               ^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 223, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self.runnable(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self._call_impl(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return forward_call(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1377, in forward
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     hidden_states = self.model(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]                     ^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/compilation/decorators.py", line 563, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     output = self.aot_compiled_fn(self, *args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self.fn(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     def forward(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self.optimized_call(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self._wrapped_call(self, *args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     raise e
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self._call_impl(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return forward_call(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "<eval_with_key>.99", line 418, in forward
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None
RAW_BUFFERClick to expand / collapse

Your current environment

<details> VLLM 16 <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Aug 14 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] (64-bit runtime)
Python platform              : Linux-5.14.0-570.58.1.el9_6.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  80
On-line CPU(s) list:                     0-79
Vendor ID:                               GenuineIntel
Model name:                              Intel Xeon Processor (Icelake)
CPU family:                              6
Model:                                   134
Thread(s) per core:                      2
Core(s) per socket:                      20
Socket(s):                               2
Stepping:                                0
BogoMIPS:                                5600.03
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities indirect_thunk_its
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               2.5 MiB (80 instances)
L1i cache:                               2.5 MiB (80 instances)
L2 cache:                                160 MiB (40 instances)
L3 cache:                                32 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-39
NUMA node1 CPU(s):                       40-79
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling:    Vulnerable: No microcode
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.4
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.19.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.1
[pip3] nvidia-cutlass-dsl-libs-base==4.4.1
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch-c-dlpack-ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.17.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	40-79	1		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	40-79	1		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	40-79	1		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	40-79	1		N/A
NIC0	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=:/usr/local/cuda-12.9/lib64
CUDA_HOME=/usr/local/cuda-12.9
CUDA_HOME=/usr/local/cuda-12.9
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_Chibukach
</details> <details> VLLM 15 <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
uv is set
==============================
        System Info
==============================
OS                           : Red Hat Enterprise Linux 9.6 (Plow) (x86_64)
GCC version                  : (GCC) 11.5.0 20240719 (Red Hat 11.5.0-5)
Clang version                : Could not collect
CMake version                : version 3.26.5
Libc version                 : glibc-2.34

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.9 (main, Aug 14 2025, 00:00:00) [GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] (64-bit runtime)
Python platform              : Linux-5.14.0-570.58.1.el9_6.x86_64-x86_64-with-glibc2.34

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.9.86
CUDA_MODULE_LOADING set to   : 
GPU models and configuration : 
GPU 0: NVIDIA A100-SXM4-80GB
GPU 1: NVIDIA A100-SXM4-80GB
GPU 2: NVIDIA A100-SXM4-80GB
GPU 3: NVIDIA A100-SXM4-80GB
GPU 4: NVIDIA A100-SXM4-80GB
GPU 5: NVIDIA A100-SXM4-80GB
GPU 6: NVIDIA A100-SXM4-80GB
GPU 7: NVIDIA A100-SXM4-80GB

Nvidia driver version        : 575.57.08
cuDNN version                : Could not collect
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           46 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  80
On-line CPU(s) list:                     0-79
Vendor ID:                               GenuineIntel
Model name:                              Intel Xeon Processor (Icelake)
CPU family:                              6
Model:                                   134
Thread(s) per core:                      2
Core(s) per socket:                      20
Socket(s):                               2
Stepping:                                0
BogoMIPS:                                5600.03
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid fsrm md_clear arch_capabilities indirect_thunk_its
Virtualization:                          VT-x
Hypervisor vendor:                       KVM
Virtualization type:                     full
L1d cache:                               2.5 MiB (80 instances)
L1i cache:                               2.5 MiB (80 instances)
L2 cache:                                160 MiB (40 instances)
L3 cache:                                32 MiB (2 instances)
NUMA node(s):                            2
NUMA node0 CPU(s):                       0-39
NUMA node1 CPU(s):                       40-79
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Mitigation; Aligned branch/return thunks
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling:    Vulnerable: No microcode
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS Not affected; BHI SW loop, KVM SW loop
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.1
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.4.0
[pip3] nvidia-cutlass-dsl-libs-base==4.4.0
[pip3] nvidia-ml-py==13.590.48
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.3.20
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1
[pip3] torchaudio==2.9.1
[pip3] torchvision==0.24.1
[pip3] transformers==5.1.0
[pip3] triton==3.5.1
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.15.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	NIC0	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU1	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NV12	PIX	0-39	0		N/A
GPU2	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU3	NV12	NV12	NV12	 X 	NV12	NV12	NV12	NV12	NODE	0-39	0		N/A
GPU4	NV12	NV12	NV12	NV12	 X 	NV12	NV12	NV12	SYS	40-79	1		N/A
GPU5	NV12	NV12	NV12	NV12	NV12	 X 	NV12	NV12	SYS	40-79	1		N/A
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	 X 	NV12	SYS	40-79	1		N/A
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	 X 	SYS	40-79	1		N/A
NIC0	PIX	PIX	NODE	NODE	SYS	SYS	SYS	SYS	 X 				

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=:/usr/local/cuda-12.9/lib64
CUDA_HOME=/usr/local/cuda-12.9
CUDA_HOME=/usr/local/cuda-12.9
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

I am unable to use the --data-parallel-size flag when using vllm serve

chg run --gpus 4 --  vllm  serve inference-optimization/Qwen3-Coder-Next.w4a16 --enable-auto-tool-choice --tool-call-parser qwen3_coder --data-parallel-size 2 --tensor-parallel-size 2

VLLM 15 serve works but it fails on inference

(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] Search for `cudaErrorIllegalAddress' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] For debugging consider passing CUDA_LAUNCH_BLOCKING=1
(EngineCore_DP2 pid=290892) ERROR 03-11 13:25:11 [core.py:948] Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

VLLM 16, vllm serve does not run at all both for MOE and Non-MOE models resulting in cuda errors

Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880] WorkerProc hit an exception.
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880] Traceback (most recent call last):
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 875, in worker_busy_loop
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     output = func(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]              ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return func(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 390, in determine_available_memory
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     self.model_runner.profile_run()
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5282, in profile_run
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     hidden_states, last_hidden_states = self._dummy_run(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]                                         ^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return func(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4976, in _dummy_run
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     outputs = self.model(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]               ^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/compilation/cuda_graph.py", line 223, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self.runnable(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self._call_impl(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return forward_call(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1377, in forward
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     hidden_states = self.model(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]                     ^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/compilation/decorators.py", line 563, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     output = self.aot_compiled_fn(self, *args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self.fn(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/model_executor/models/qwen3_next.py", line 1132, in forward
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     def forward(
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/vllm/compilation/caching.py", line 198, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self.optimized_call(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/fx/graph_module.py", line 936, in call_wrapped
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self._wrapped_call(self, *args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/fx/graph_module.py", line 455, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     raise e
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/fx/graph_module.py", line 442, in __call__
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return self._call_impl(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "/home/Chibukach/16VLLM/.venv/lib64/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     return forward_call(*args, **kwargs)
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]   File "<eval_with_key>.99", line 418, in forward
(Worker pid=307623) (Worker_DP0_TP0 pid=307623) ERROR 03-11 13:50:40 [multiproc_executor.py:880]     submod_0 = self.submod_0(l_input_ids_, s72, l_self_modules_embed_tokens_parameters_weight_, l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_, l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_);  l_input_ids_ = l_self_modules_embed_tokens_parameters_weight_ = l_self_modules_layers_modules_0_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_qkvz_parameters_weight_ = l_self_modules_layers_modules_0_modules_linear_attn_modules_in_proj_ba_parameters_weight_ = None

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the --data-parallel-size flag in vllm serve. To fix this, we need to adjust the data parallelism settings.

Here are the steps to follow:

  • Check the GPU topology and adjust the --data-parallel-size flag accordingly.
  • Ensure that the --tensor-parallel-size flag is set correctly.
  • If using VLLM 16, try setting the CUDA_LAUNCH_BLOCKING environment variable to 1.

Example code:

import os

# Set environment variables
os.environ["CUDA_LAUNCH_BLOCKING"] = "1"

# Run vllm serve with adjusted flags
command = "chg run --gpus 4 -- vllm serve inference-optimization/Qwen3-Coder-Next.w4a16 --enable-auto-tool-choice --tool-call-parser qwen3_coder --data-parallel-size 2 --tensor-parallel-size 2"
os.system(command)

Note: The above code is just an example and may need to be adjusted based on the specific use case.

Verification

To verify that the fix worked, run the vllm serve command with the adjusted flags and check for any errors. If the command runs successfully, it should indicate that the data parallelism settings have been applied correctly.

Extra Tips

  • Make sure to check the GPU topology and adjust the --data-parallel-size flag accordingly to avoid any conflicts.
  • If using VLLM 16, setting the CUDA_LAUNCH_BLOCKING environment variable to 1 can help with debugging.
  • Refer to the VLLM documentation for more information on data parallelism and tensor parallelism settings.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING