vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 397B GPTQ model outputs all exclamation points on ROCM [11 comments, 5 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37996Fetched 2026-04-08 01:22:04
View on GitHub
Comments
11
Participants
5
Timeline
23
Reactions
0
Author
Timeline (top)
commented ×11mentioned ×4subscribed ×4labeled ×2

Error Message

I tried smaller models like the Qwen 122B model, and no decode error found. The two models have the same structure, so it is confusing.

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 43 bits physical, 48 bits virtual Byte Order: Little Endian CPU(s): 32 On-line CPU(s) list: 0-31 Vendor ID: AuthenticAMD Model name: AMD EPYC 7282 16-Core Processor CPU family: 23 Model: 49 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 2 Stepping: 0 Frequency boost: enabled CPU max MHz: 2800.0000 CPU min MHz: 1500.0000 BogoMIPS: 5599.86 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es ibpb_exit_to_user Virtualization: AMD-V L1d cache: 1 MiB (32 instances) L1i cache: 1 MiB (32 instances) L2 cache: 16 MiB (32 instances) L3 cache: 128 MiB (8 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-15 NUMA node1 CPU(s): 16-31 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Mitigation; untrained return thunk; SMT disabled Vulnerability Spec rstack overflow: Mitigation; SMT disabled Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version                : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e)
CMake version                : version 3.31.10
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+git8907517
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.0.51831-a3e329ad8

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-87-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :  (gfx908:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.0.51831
MIOpen runtime version       : 3.5.0
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7282 16-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   1
Core(s) per socket:                   16
Socket(s):                            2
Stepping:                             0
Frequency boost:                      enabled
CPU max MHz:                          2800.0000
CPU min MHz:                          1500.0000
BogoMIPS:                             5599.86
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es ibpb_exit_to_user
Virtualization:                       AMD-V
L1d cache:                            1 MiB (32 instances)
L1i cache:                            1 MiB (32 instances)
L2 cache:                             16 MiB (32 instances)
L3 cache:                             128 MiB (8 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-15
NUMA node1 CPU(s):                    16-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT disabled
Vulnerability Spec rstack overflow:   Mitigation; SMT disabled
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Vulnerability Vmscape:                Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.2.6
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.86
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+git8907517
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] transformers==5.2.0
[pip3] triton==3.4.0
[pip3] triton_kernels==1.0.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.0.51831-a3e329ad8
vLLM Version                 : 0.16.1rc1.dev151+gd3bab5eb0 (git sha: d3bab5eb0)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            40           15           15           72           72           72           15           
GPU1   40           0            40           40           15           15           15           72           
GPU2   15           40           0            15           72           72           72           15           
GPU3   15           40           15           0            72           72           72           15           
GPU4   72           15           72           72           0            15           15           40           
GPU5   72           15           72           72           15           0            15           40           
GPU6   72           15           72           72           15           15           0            40           
GPU7   15           72           15           15           40           40           40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            2            1            1            3            3            3            1            
GPU1   2            0            2            2            1            1            1            3            
GPU2   1            2            0            1            3            3            3            1            
GPU3   1            2            1            0            3            3            3            1            
GPU4   3            1            3            3            0            1            1            2            
GPU5   3            1            3            3            1            0            1            2            
GPU6   3            1            3            3            1            1            0            2            
GPU7   1            3            1            1            2            2            2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            PCIE         XGMI         XGMI         PCIE         PCIE         PCIE         XGMI         
GPU1   PCIE         0            PCIE         PCIE         XGMI         XGMI         XGMI         PCIE         
GPU2   XGMI         PCIE         0            XGMI         PCIE         PCIE         PCIE         XGMI         
GPU3   XGMI         PCIE         XGMI         0            PCIE         PCIE         PCIE         XGMI         
GPU4   PCIE         XGMI         PCIE         PCIE         0            XGMI         XGMI         PCIE         
GPU5   PCIE         XGMI         PCIE         PCIE         XGMI         0            XGMI         PCIE         
GPU6   PCIE         XGMI         PCIE         PCIE         XGMI         XGMI         0            PCIE         
GPU7   XGMI         PCIE         XGMI         XGMI         PCIE         PCIE         PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 0
GPU[2]		: (Topology) Numa Node: 0
GPU[2]		: (Topology) Numa Affinity: 0
GPU[3]		: (Topology) Numa Node: 0
GPU[3]		: (Topology) Numa Affinity: 0
GPU[4]		: (Topology) Numa Node: 1
GPU[4]		: (Topology) Numa Affinity: 1
GPU[5]		: (Topology) Numa Node: 1
GPU[5]		: (Topology) Numa Affinity: 1
GPU[6]		: (Topology) Numa Node: 1
GPU[6]		: (Topology) Numa Affinity: 1
GPU[7]		: (Topology) Numa Node: 1
GPU[7]		: (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
VLLM_ROCM_USE_AITER=1
CUDA_VISIBLE_DEVICES=1,4,5,6,0,2,3,7
CUDA_VISIBLE_DEVICES=1,4,5,6,0,2,3,7
PYTORCH_ROCM_ARCH=gfx908
MAX_JOBS=8
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version                : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e)
CMake version                : version 3.31.10
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+git8907517
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.0.51831-a3e329ad8

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-6.8.0-87-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :  (gfx908:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.0.51831
MIOpen runtime version       : 3.5.0
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        43 bits physical, 48 bits virtual
Byte Order:                           Little Endian
CPU(s):                               32
On-line CPU(s) list:                  0-31
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 7282 16-Core Processor
CPU family:                           23
Model:                                49
Thread(s) per core:                   1
Core(s) per socket:                   16
Socket(s):                            2
Stepping:                             0
Frequency boost:                      enabled
CPU max MHz:                          2800.0000
CPU min MHz:                          1500.0000
BogoMIPS:                             5599.86
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sev sev_es ibpb_exit_to_user
Virtualization:                       AMD-V
L1d cache:                            1 MiB (32 instances)
L1i cache:                            1 MiB (32 instances)
L2 cache:                             16 MiB (32 instances)
L3 cache:                             128 MiB (8 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-15
NUMA node1 CPU(s):                    16-31
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; untrained return thunk; SMT disabled
Vulnerability Spec rstack overflow:   Mitigation; SMT disabled
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Retpolines; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected
Vulnerability Vmscape:                Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.2.6
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.86
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+git8907517
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] transformers==5.2.0
[pip3] triton==3.4.0
[pip3] triton_kernels==1.0.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.0.51831-a3e329ad8
vLLM Version                 : 0.16.1rc1.dev151+gd3bab5eb0 (git sha: d3bab5eb0)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            40           15           15           72           72           72           15           
GPU1   40           0            40           40           15           15           15           72           
GPU2   15           40           0            15           72           72           72           15           
GPU3   15           40           15           0            72           72           72           15           
GPU4   72           15           72           72           0            15           15           40           
GPU5   72           15           72           72           15           0            15           40           
GPU6   72           15           72           72           15           15           0            40           
GPU7   15           72           15           15           40           40           40           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            2            1            1            3            3            3            1            
GPU1   2            0            2            2            1            1            1            3            
GPU2   1            2            0            1            3            3            3            1            
GPU3   1            2            1            0            3            3            3            1            
GPU4   3            1            3            3            0            1            1            2            
GPU5   3            1            3            3            1            0            1            2            
GPU6   3            1            3            3            1            1            0            2            
GPU7   1            3            1            1            2            2            2            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            PCIE         XGMI         XGMI         PCIE         PCIE         PCIE         XGMI         
GPU1   PCIE         0            PCIE         PCIE         XGMI         XGMI         XGMI         PCIE         
GPU2   XGMI         PCIE         0            XGMI         PCIE         PCIE         PCIE         XGMI         
GPU3   XGMI         PCIE         XGMI         0            PCIE         PCIE         PCIE         XGMI         
GPU4   PCIE         XGMI         PCIE         PCIE         0            XGMI         XGMI         PCIE         
GPU5   PCIE         XGMI         PCIE         PCIE         XGMI         0            XGMI         PCIE         
GPU6   PCIE         XGMI         PCIE         PCIE         XGMI         XGMI         0            PCIE         
GPU7   XGMI         PCIE         XGMI         XGMI         PCIE         PCIE         PCIE         0            

======================================= Numa Nodes =======================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 0
GPU[2]		: (Topology) Numa Node: 0
GPU[2]		: (Topology) Numa Affinity: 0
GPU[3]		: (Topology) Numa Node: 0
GPU[3]		: (Topology) Numa Affinity: 0
GPU[4]		: (Topology) Numa Node: 1
GPU[4]		: (Topology) Numa Affinity: 1
GPU[5]		: (Topology) Numa Node: 1
GPU[5]		: (Topology) Numa Affinity: 1
GPU[6]		: (Topology) Numa Node: 1
GPU[6]		: (Topology) Numa Affinity: 1
GPU[7]		: (Topology) Numa Node: 1
GPU[7]		: (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
VLLM_ROCM_USE_AITER=1
CUDA_VISIBLE_DEVICES=1,4,5,6,0,2,3,7
CUDA_VISIBLE_DEVICES=1,4,5,6,0,2,3,7
PYTORCH_ROCM_ARCH=gfx908
MAX_JOBS=8
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

vllm only outputs exclamation points for the Qwen3.5 397B GPTQ model. Following is the log.

I tried smaller models like the Qwen 122B model, and no decode error found. The two models have the same structure, so it is confusing.

./vllm_397.sh WARNING 03-24 10:45:08 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:293] (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:293] █ █ █▄ ▄█ (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:293] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.16.1rc1.dev151+gd3bab5eb0 (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:293] █▄█▀ █ █ █ █ model /models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:293] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:293] (APIServer pid=1) INFO 03-24 10:45:12 [utils.py:229] non-default args: {'model_tag': '/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'api_key': ['xxxxxx'], 'model': '/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', 'dtype': 'float16', 'max_model_len': 8000, 'enforce_eager': True, 'served_model_name': ['qwen3.5'], 'pipeline_parallel_size': 2, 'tensor_parallel_size': 4, 'limit_mm_per_prompt': {'image': 2}, 'skip_mm_profiling': True} (APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=1) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=1) INFO 03-24 10:45:22 [model.py:530] Resolved architecture: Qwen3_5MoeForConditionalGeneration (APIServer pid=1) WARNING 03-24 10:45:22 [model.py:1891] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 03-24 10:45:22 [model.py:1553] Using max model len 8000 (APIServer pid=1) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (APIServer pid=1) [2026-03-24 10:45:22] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (APIServer pid=1) [aiter] start build [module_aiter_enum] under /workspace/aiter/aiter/jit/build/module_aiter_enum (APIServer pid=1) [2026-03-24 10:45:22] INFO core.py:550: start build [module_aiter_enum] under /workspace/aiter/aiter/jit/build/module_aiter_enum (APIServer pid=1) [aiter] finish build [module_aiter_enum], cost 17.6s (APIServer pid=1) [2026-03-24 10:45:40] INFO core.py:700: finish build [module_aiter_enum], cost 17.6s (APIServer pid=1) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (APIServer pid=1) [2026-03-24 10:45:40] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (APIServer pid=1) INFO 03-24 10:45:42 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 03-24 10:45:42 [config.py:536] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-24 10:45:42 [config.py:567] Padding mamba page size by 1.34% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) WARNING 03-24 10:45:42 [gptq.py:99] Currently, the 4-bit gptq_gemm kernel for GPTQ is buggy. Please switch to gptq_marlin. (APIServer pid=1) INFO 03-24 10:45:43 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) WARNING 03-24 10:45:43 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=1) WARNING 03-24 10:45:43 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=1) INFO 03-24 10:45:43 [vllm.py:930] Cudagraph is disabled under eager mode WARNING 03-24 10:46:05 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (EngineCore_DP0 pid=495) INFO 03-24 10:46:06 [core.py:101] Initializing a V1 LLM engine (v0.16.1rc1.dev151+gd3bab5eb0) with config: model='/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', speculative_config=None, tokenizer='/models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=8000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=2, data_parallel_size=1, disable_custom_all_reduce=True, quantization=gptq, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen3.5, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+rotary_embedding', '+sparse_attn_indexer', 'all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=495) WARNING 03-24 10:46:06 [multiproc_executor.py:945] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore_DP0 pid=495) INFO 03-24 10:46:06 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.176.26.34 (local), world_size=8, local_world_size=8 WARNING 03-24 10:46:14 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=597) INFO 03-24 10:46:17 [parallel_state.py:1392] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:46:22 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=603) INFO 03-24 10:46:25 [parallel_state.py:1392] world_size=8 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:46:30 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=619) INFO 03-24 10:46:33 [parallel_state.py:1392] world_size=8 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:46:38 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=639) INFO 03-24 10:46:41 [parallel_state.py:1392] world_size=8 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:46:46 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=659) INFO 03-24 10:46:49 [parallel_state.py:1392] world_size=8 rank=4 local_rank=4 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:46:54 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=679) INFO 03-24 10:46:57 [parallel_state.py:1392] world_size=8 rank=5 local_rank=5 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:47:02 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=699) INFO 03-24 10:47:05 [parallel_state.py:1392] world_size=8 rank=6 local_rank=6 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl WARNING 03-24 10:47:10 [gpt_oss_triton_kernels_moe.py:56] Using legacy triton_kernels on ROCm (Worker pid=719) INFO 03-24 10:47:13 [parallel_state.py:1392] world_size=8 rank=7 local_rank=7 distributed_init_method=tcp://127.0.0.1:36793 backend=nccl (Worker pid=597) INFO 03-24 10:47:13 [pynccl.py:111] vLLM is using nccl==2.26.6 (Worker pid=597) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 0 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (Worker pid=639) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 3 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A (Worker pid=619) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 2 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A (Worker pid=603) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 1 in world size 8 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A (Worker pid=659) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 4 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (Worker pid=679) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 5 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 1, EP rank 1, EPLB rank N/A (Worker pid=699) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 6 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A (Worker pid=719) INFO 03-24 10:47:23 [parallel_state.py:1714] rank 7 in world size 8 is assigned as DP rank 0, PP rank 1, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A (Worker pid=659) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=659) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=659) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=659) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=597) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=597) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=597) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=597) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=719) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=719) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=719) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=719) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=639) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=639) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=639) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=639) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=619) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=619) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=619) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=619) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=679) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=679) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=679) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=679) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=659) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=597) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=719) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=619) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=699) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=699) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=699) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=699) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=603) [aiter] WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=603) [2026-03-24 10:47:24] WARNING core.py:479: WARNING: NUMA balancing is enabled, which may cause errors. It is recommended to disable NUMA balancing by running "sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'" for more details: https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#disable-numa-auto-balancing (Worker pid=603) [aiter] import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=603) [2026-03-24 10:47:24] INFO core.py:502: import [module_aiter_enum] under /workspace/aiter/aiter/jit/module_aiter_enum.so (Worker pid=639) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=679) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=699) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=603) INFO 03-24 10:47:24 [topk_topp_sampler.py:81] Using aiter sampler on ROCm (lazy import, sampling-only). (Worker pid=619) INFO 03-24 10:47:33 [base.py:106] Offloader set to NoopOffloader (Worker pid=679) INFO 03-24 10:47:34 [base.py:106] Offloader set to NoopOffloader (Worker pid=719) INFO 03-24 10:47:34 [base.py:106] Offloader set to NoopOffloader (Worker pid=639) INFO 03-24 10:47:34 [base.py:106] Offloader set to NoopOffloader (Worker pid=659) INFO 03-24 10:47:34 [base.py:106] Offloader set to NoopOffloader (Worker pid=699) INFO 03-24 10:47:34 [base.py:106] Offloader set to NoopOffloader (Worker pid=597) INFO 03-24 10:47:34 [base.py:106] Offloader set to NoopOffloader (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:47:34 [gpu_model_runner.py:4202] Starting to load model /models/Qwen3.5-397B-A17B-GPTQ-Int4/snapshots/hash... (Worker pid=619) (Worker_PP0_TP2 pid=619) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=619) (Worker_PP0_TP2 pid=619) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=619) (Worker_PP0_TP2 pid=619) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=679) (Worker_PP1_TP1 pid=679) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=679) (Worker_PP1_TP1 pid=679) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=679) (Worker_PP1_TP1 pid=679) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=719) (Worker_PP1_TP3 pid=719) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=719) (Worker_PP1_TP3 pid=719) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=639) (Worker_PP0_TP3 pid=639) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=639) (Worker_PP0_TP3 pid=639) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=719) (Worker_PP1_TP3 pid=719) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=639) (Worker_PP0_TP3 pid=639) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=603) INFO 03-24 10:47:35 [base.py:106] Offloader set to NoopOffloader (Worker pid=659) (Worker_PP1_TP0 pid=659) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=659) (Worker_PP1_TP0 pid=659) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=699) (Worker_PP1_TP2 pid=699) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=699) (Worker_PP1_TP2 pid=699) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=659) (Worker_PP1_TP0 pid=659) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:47:35 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=597) (Worker_PP0_TP0 pid=597) WARNING 03-24 10:47:35 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=699) (Worker_PP1_TP2 pid=699) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:47:35 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=603) (Worker_PP0_TP1 pid=603) INFO 03-24 10:47:36 [rocm.py:556] Using Flash Attention backend for ViT model. (Worker pid=603) (Worker_PP0_TP1 pid=603) WARNING 03-24 10:47:36 [activation.py:643] [ROCm] PyTorch's native GELU with tanh approximation is unstable. Falling back to GELU(approximate='none'). (Worker pid=603) (Worker_PP0_TP1 pid=603) INFO 03-24 10:47:36 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (Worker pid=679) (Worker_PP1_TP1 pid=679) INFO 03-24 10:47:38 [rocm.py:510] Using Triton Attention backend. (Worker pid=619) (Worker_PP0_TP2 pid=619) INFO 03-24 10:47:38 [rocm.py:510] Using Triton Attention backend. (Worker pid=719) (Worker_PP1_TP3 pid=719) INFO 03-24 10:47:38 [rocm.py:510] Using Triton Attention backend. (Worker pid=659) (Worker_PP1_TP0 pid=659) INFO 03-24 10:47:38 [rocm.py:510] Using Triton Attention backend. (Worker pid=699) (Worker_PP1_TP2 pid=699) INFO 03-24 10:47:39 [rocm.py:510] Using Triton Attention backend. (Worker pid=639) (Worker_PP0_TP3 pid=639) INFO 03-24 10:47:39 [rocm.py:510] Using Triton Attention backend. (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:47:39 [rocm.py:510] Using Triton Attention backend. (Worker pid=603) (Worker_PP0_TP1 pid=603) INFO 03-24 10:47:40 [rocm.py:510] Using Triton Attention backend. (Worker pid=619) (Worker_PP0_TP2 pid=619) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect (Worker pid=679) (Worker_PP1_TP1 pid=679) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect (Worker pid=597) (Worker_PP0_TP0 pid=597) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect (Worker pid=639) (Worker_PP0_TP3 pid=639) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect (Worker pid=719) (Worker_PP1_TP3 pid=719) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect (Worker pid=659) (Worker_PP1_TP0 pid=659) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect (Worker pid=699) (Worker_PP1_TP2 pid=699) WARNING 03-24 10:47:43 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect Loading safetensors checkpoint shards: 0% Completed | 0/94 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 1% Completed | 1/94 [00:00<00:48, 1.92it/s] (Worker pid=603) (Worker_PP0_TP1 pid=603) WARNING 03-24 10:47:44 [compilation.py:1130] Op 'sparse_attn_indexer' not present in model, enabling with '+sparse_attn_indexer' has no effect Loading safetensors checkpoint shards: 2% Completed | 2/94 [00:01<00:47, 1.94it/s] Loading safetensors checkpoint shards: 3% Completed | 3/94 [00:02<01:29, 1.01it/s] Loading safetensors checkpoint shards: 4% Completed | 4/94 [00:04<01:52, 1.25s/it] Loading safetensors checkpoint shards: 5% Completed | 5/94 [00:05<02:04, 1.40s/it] Loading safetensors checkpoint shards: 6% Completed | 6/94 [00:06<01:39, 1.14s/it] Loading safetensors checkpoint shards: 7% Completed | 7/94 [00:08<01:50, 1.27s/it] Loading safetensors checkpoint shards: 9% Completed | 8/94 [00:09<02:01, 1.42s/it] Loading safetensors checkpoint shards: 10% Completed | 9/94 [00:11<02:08, 1.51s/it] Loading safetensors checkpoint shards: 11% Completed | 10/94 [00:12<01:43, 1.24s/it] Loading safetensors checkpoint shards: 12% Completed | 11/94 [00:13<01:50, 1.33s/it] Loading safetensors checkpoint shards: 13% Completed | 12/94 [00:15<01:58, 1.45s/it] Loading safetensors checkpoint shards: 14% Completed | 13/94 [00:17<02:03, 1.52s/it] Loading safetensors checkpoint shards: 15% Completed | 14/94 [00:17<01:39, 1.25s/it] Loading safetensors checkpoint shards: 16% Completed | 15/94 [00:19<01:45, 1.34s/it] Loading safetensors checkpoint shards: 17% Completed | 16/94 [00:20<01:53, 1.45s/it] Loading safetensors checkpoint shards: 18% Completed | 17/94 [00:21<01:32, 1.20s/it] Loading safetensors checkpoint shards: 19% Completed | 18/94 [00:23<01:39, 1.31s/it] Loading safetensors checkpoint shards: 20% Completed | 19/94 [00:24<01:47, 1.43s/it] Loading safetensors checkpoint shards: 21% Completed | 20/94 [00:26<01:51, 1.51s/it] Loading safetensors checkpoint shards: 22% Completed | 21/94 [00:27<01:30, 1.24s/it] Loading safetensors checkpoint shards: 23% Completed | 22/94 [00:27<01:13, 1.02s/it] Loading safetensors checkpoint shards: 24% Completed | 23/94 [00:28<01:01, 1.15it/s] Loading safetensors checkpoint shards: 26% Completed | 24/94 [00:28<00:53, 1.31it/s] Loading safetensors checkpoint shards: 27% Completed | 25/94 [00:30<01:09, 1.00s/it] Loading safetensors checkpoint shards: 28% Completed | 26/94 [00:30<01:00, 1.12it/s] Loading safetensors checkpoint shards: 29% Completed | 27/94 [00:32<01:13, 1.09s/it] Loading safetensors checkpoint shards: 30% Completed | 28/94 [00:34<01:23, 1.27s/it] Loading safetensors checkpoint shards: 31% Completed | 29/94 [00:35<01:30, 1.39s/it] Loading safetensors checkpoint shards: 32% Completed | 30/94 [00:36<01:14, 1.16s/it] Loading safetensors checkpoint shards: 33% Completed | 31/94 [00:38<01:21, 1.29s/it] Loading safetensors checkpoint shards: 34% Completed | 32/94 [00:38<01:07, 1.09s/it] Loading safetensors checkpoint shards: 35% Completed | 33/94 [00:39<00:55, 1.09it/s] Loading safetensors checkpoint shards: 36% Completed | 34/94 [00:40<01:06, 1.11s/it] Loading safetensors checkpoint shards: 37% Completed | 35/94 [00:42<01:15, 1.29s/it] Loading safetensors checkpoint shards: 38% Completed | 36/94 [00:43<01:03, 1.09s/it] Loading safetensors checkpoint shards: 39% Completed | 37/94 [00:44<01:10, 1.23s/it] Loading safetensors checkpoint shards: 40% Completed | 38/94 [00:46<01:17, 1.38s/it] Loading safetensors checkpoint shards: 41% Completed | 39/94 [00:46<01:03, 1.16s/it] Loading safetensors checkpoint shards: 43% Completed | 40/94 [00:48<01:09, 1.28s/it] Loading safetensors checkpoint shards: 44% Completed | 41/94 [00:50<01:15, 1.42s/it] Loading safetensors checkpoint shards: 45% Completed | 42/94 [00:52<01:18, 1.51s/it] Loading safetensors checkpoint shards: 46% Completed | 43/94 [00:53<01:20, 1.58s/it] Loading safetensors checkpoint shards: 47% Completed | 44/94 [00:54<01:04, 1.29s/it] Loading safetensors checkpoint shards: 48% Completed | 45/94 [00:55<01:07, 1.37s/it] Loading safetensors checkpoint shards: 49% Completed | 46/94 [00:57<01:10, 1.48s/it] Loading safetensors checkpoint shards: 50% Completed | 47/94 [00:59<01:12, 1.55s/it] Loading safetensors checkpoint shards: 51% Completed | 48/94 [01:00<00:58, 1.27s/it] Loading safetensors checkpoint shards: 52% Completed | 49/94 [01:00<00:47, 1.04s/it] Loading safetensors checkpoint shards: 53% Completed | 50/94 [01:01<00:39, 1.13it/s] Loading safetensors checkpoint shards: 54% Completed | 51/94 [01:01<00:33, 1.29it/s] Loading safetensors checkpoint shards: 55% Completed | 52/94 [01:02<00:29, 1.43it/s] Loading safetensors checkpoint shards: 56% Completed | 53/94 [01:02<00:26, 1.55it/s] Loading safetensors checkpoint shards: 57% Completed | 54/94 [01:03<00:24, 1.65it/s] Loading safetensors checkpoint shards: 59% Completed | 55/94 [01:03<00:22, 1.73it/s] Loading safetensors checkpoint shards: 60% Completed | 56/94 [01:04<00:21, 1.79it/s] Loading safetensors checkpoint shards: 61% Completed | 57/94 [01:04<00:20, 1.83it/s] Loading safetensors checkpoint shards: 62% Completed | 58/94 [01:05<00:19, 1.86it/s] Loading safetensors checkpoint shards: 63% Completed | 59/94 [01:05<00:18, 1.88it/s] Loading safetensors checkpoint shards: 64% Completed | 60/94 [01:06<00:17, 1.90it/s] Loading safetensors checkpoint shards: 65% Completed | 61/94 [01:06<00:17, 1.91it/s] Loading safetensors checkpoint shards: 66% Completed | 62/94 [01:08<00:26, 1.21it/s] Loading safetensors checkpoint shards: 67% Completed | 63/94 [01:09<00:28, 1.09it/s] Loading safetensors checkpoint shards: 68% Completed | 64/94 [01:11<00:34, 1.13s/it] Loading safetensors checkpoint shards: 69% Completed | 65/94 [01:12<00:33, 1.15s/it] Loading safetensors checkpoint shards: 70% Completed | 66/94 [01:13<00:36, 1.29s/it] Loading safetensors checkpoint shards: 71% Completed | 67/94 [01:15<00:34, 1.26s/it] Loading safetensors checkpoint shards: 72% Completed | 68/94 [01:16<00:35, 1.37s/it] Loading safetensors checkpoint shards: 73% Completed | 69/94 [01:17<00:32, 1.31s/it] Loading safetensors checkpoint shards: 74% Completed | 70/94 [01:19<00:33, 1.41s/it] Loading safetensors checkpoint shards: 76% Completed | 71/94 [01:20<00:27, 1.18s/it] Loading safetensors checkpoint shards: 77% Completed | 72/94 [01:20<00:21, 1.02it/s] Loading safetensors checkpoint shards: 78% Completed | 73/94 [01:21<00:20, 1.00it/s] Loading safetensors checkpoint shards: 79% Completed | 74/94 [01:23<00:23, 1.19s/it] Loading safetensors checkpoint shards: 80% Completed | 75/94 [01:24<00:22, 1.18s/it] Loading safetensors checkpoint shards: 81% Completed | 76/94 [01:25<00:20, 1.15s/it] Loading safetensors checkpoint shards: 82% Completed | 77/94 [01:26<00:19, 1.14s/it] Loading safetensors checkpoint shards: 83% Completed | 78/94 [01:27<00:18, 1.13s/it] Loading safetensors checkpoint shards: 84% Completed | 79/94 [01:29<00:19, 1.28s/it] Loading safetensors checkpoint shards: 85% Completed | 80/94 [01:30<00:17, 1.26s/it] Loading safetensors checkpoint shards: 86% Completed | 81/94 [01:32<00:17, 1.38s/it] Loading safetensors checkpoint shards: 87% Completed | 82/94 [01:33<00:15, 1.33s/it] Loading safetensors checkpoint shards: 88% Completed | 83/94 [01:35<00:15, 1.42s/it] Loading safetensors checkpoint shards: 89% Completed | 84/94 [01:36<00:13, 1.36s/it] Loading safetensors checkpoint shards: 90% Completed | 85/94 [01:36<00:10, 1.13s/it] Loading safetensors checkpoint shards: 91% Completed | 86/94 [01:37<00:07, 1.06it/s] Loading safetensors checkpoint shards: 93% Completed | 87/94 [01:37<00:05, 1.22it/s] Loading safetensors checkpoint shards: 94% Completed | 88/94 [01:38<00:04, 1.37it/s] Loading safetensors checkpoint shards: 95% Completed | 89/94 [01:39<00:03, 1.50it/s] Loading safetensors checkpoint shards: 96% Completed | 90/94 [01:39<00:02, 1.60it/s] Loading safetensors checkpoint shards: 97% Completed | 91/94 [01:40<00:02, 1.49it/s] Loading safetensors checkpoint shards: 98% Completed | 92/94 [01:41<00:01, 1.46it/s] Loading safetensors checkpoint shards: 100% Completed | 94/94 [01:41<00:00, 2.11it/s] Loading safetensors checkpoint shards: 100% Completed | 94/94 [01:41<00:00, 1.08s/it] (Worker pid=597) (Worker_PP0_TP0 pid=597) (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:49:25 [default_loader.py:293] Loading weights took 101.51 seconds (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:49:26 [gpu_model_runner.py:4285] Model loading took 25.99 GiB memory and 110.436769 seconds (Worker pid=659) (Worker_PP1_TP0 pid=659) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=639) (Worker_PP0_TP3 pid=639) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=619) (Worker_PP0_TP2 pid=619) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=699) (Worker_PP1_TP2 pid=699) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=603) (Worker_PP0_TP1 pid=603) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=679) (Worker_PP1_TP1 pid=679) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=719) (Worker_PP1_TP3 pid=719) INFO 03-24 10:49:39 [gpu_model_runner.py:5180] Skipping memory profiling for multimodal encoder and encoder cache. (Worker pid=597) (Worker_PP0_TP0 pid=597) WARNING 03-24 10:49:42 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=512,N=256,device_name=Arcturus_GL-XL_[Instinct_MI100],dtype=int4_w4a16.json (Worker pid=597) (Worker_PP0_TP0 pid=597) INFO 03-24 10:49:47 [gpu_worker.py:423] Available KV cache memory: 2.32 GiB (EngineCore_DP0 pid=495) INFO 03-24 10:49:52 [kv_cache_utils.py:1314] GPU KV cache size: 25,344 tokens (EngineCore_DP0 pid=495) INFO 03-24 10:49:52 [kv_cache_utils.py:1319] Maximum concurrency for 8,000 tokens per request: 8.82x (EngineCore_DP0 pid=495) INFO 03-24 10:49:55 [core.py:282] init engine (profile, create kv cache, warmup model) took 16.19 seconds (EngineCore_DP0 pid=495) INFO 03-24 10:50:07 [vllm.py:747] Asynchronous scheduling is enabled. (EngineCore_DP0 pid=495) WARNING 03-24 10:50:07 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP0 pid=495) WARNING 03-24 10:50:07 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (EngineCore_DP0 pid=495) INFO 03-24 10:50:07 [vllm.py:930] Cudagraph is disabled under eager mode (APIServer pid=1) INFO 03-24 10:50:07 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-24 10:50:08 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) WARNING 03-24 10:50:08 [model.py:1354] Default vLLM sampling parameters have been overridden by the model's generation_config.json: {'temperature': 0.6, 'top_k': 20, 'top_p': 0.95}. If this is not intended, please relaunch vLLM instance with --generation-config vllm. (APIServer pid=1) INFO 03-24 10:50:08 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) INFO 03-24 10:50:08 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-24 10:50:11 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=1) INFO 03-24 10:50:11 [serving.py:210] Chat template warmup completed in 2893.0ms (APIServer pid=1) INFO 03-24 10:50:11 [parser_manager.py:202] "auto" tool choice has been enabled. (APIServer pid=1) INFO 03-24 10:50:11 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000 (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /docs, Methods: HEAD, GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /redoc, Methods: HEAD, GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-24 10:50:11 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (APIServer pid=1) INFO: 172.17.0.2:37824 - "GET /v1/models HTTP/1.1" 200 OK (APIServer pid=1) INFO: 172.17.0.2:47092 - "POST /v1/chat/completions HTTP/1.1" 200 OK (EngineCore_DP0 pid=495) INFO 03-24 10:51:46 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (Worker pid=639) (Worker_PP0_TP3 pid=639) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.) (Worker pid=639) (Worker_PP0_TP3 pid=639) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) [rank3]:[W324 10:51:52.524872127 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) [rank7]:[W324 10:51:52.525488786 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) (Worker pid=597) (Worker_PP0_TP0 pid=597) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.) (Worker pid=597) (Worker_PP0_TP0 pid=597) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) [rank0]:[W324 10:51:52.527694873 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) [rank4]:[W324 10:51:52.528073643 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) (Worker pid=603) (Worker_PP0_TP1 pid=603) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.) (Worker pid=603) (Worker_PP0_TP1 pid=603) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) [rank1]:[W324 10:51:52.548049593 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) [rank5]:[W324 10:51:52.548291953 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) (Worker pid=619) (Worker_PP0_TP2 pid=619) /usr/local/lib/python3.12/dist-packages/vllm/distributed/parallel_state.py:659: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /app/pytorch/torch/csrc/utils/tensor_new.cpp:1581.) (Worker pid=619) (Worker_PP0_TP2 pid=619) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) [rank2]:[W324 10:51:52.843020067 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) [rank6]:[W324 10:51:52.843364617 ProcessGroupNCCL.cpp:4004] Warning: An unbatched P2P op (send/recv) was called on this ProcessGroup with size 2. In lazy initialization mode, this will result in a new 2-rank NCCL communicator to be created. (function operator()) (EngineCore_DP0 pid=495) INFO 03-24 10:52:46 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (APIServer pid=1) INFO 03-24 10:53:01 [loggers.py:259] Engine 000: Avg prompt throughput: 286.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-24 10:53:11 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 0.0% (APIServer pid=1) INFO 03-24 10:53:21 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 12.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 6.2%, Prefix cache hit rate: 0.0%

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to the Qwen3.5 397B GPTQ model outputting only exclamation points. To fix this, we can try the following steps:

  • Disable NUMA balancing: Run the command sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing' to disable NUMA balancing, as recommended in the warnings.
  • Check model configuration: Verify that the model configuration is correct, especially the generation_config.json file, which overrides the default vLLM sampling parameters.
  • Update PyTorch and dependencies: Ensure that PyTorch and its dependencies are up-to-date, as the issue might be related to a version mismatch.
  • Try a different model: If possible, try using a different model to see if the issue is specific to the Qwen3.5 397B GPTQ model.

Example code to disable NUMA balancing:

import subprocess

# Disable NUMA balancing
subprocess.run(['sudo', 'sh', '-c', 'echo 0 > /proc/sys/kernel/numa_balancing'])

Verification

To verify that the fix worked, you can try the following:

  • Check the model output: Run the model with the same input and verify that the output is no longer only exclamation points.
  • Monitor system resources: Use tools like htop or nvidia-smi to monitor system resources and ensure that the model is utilizing the GPUs correctly.

Extra Tips

  • Check the documentation: Refer to the vLLM documentation for any specific requirements or recommendations for running the Qwen3.5 397B GPTQ model.
  • Search for similar issues: Search for similar issues on GitHub or other forums to see if others have encountered and resolved the same problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 397B GPTQ model outputs all exclamation points on ROCM [11 comments, 5 participants]