vllm - ✅(Solved) Fix [Bug]: error in the vllm deployment model gemma-4-31B-it-unsloth-bnb-4bit [1 pull requests, 1 participants]

vllm2026-04-21 06:13:44

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40437•Fetched 2026-04-22 07:45:38

View on GitHub

Comments

Participants

Timeline

Reactions

Author

GoGo-UpUp

Participants

GoGo-UpUp

Timeline (top)

labeled ×1

Error Message

(EngineCore pid=49951) INFO 04-21 13:58:31 [gpu_model_runner.py:4820] Model loading took 19.61 GiB memory and 7.125814 seconds (EngineCore pid=49951) INFO 04-21 13:58:32 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 video items of the maximum feature size. (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] EngineCore failed to start. (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] Traceback (most recent call last): (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in init (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] super().init( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in init (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return self.collective_rpc("determine_available_memory") (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] self.model_runner.profile_run() (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] hidden_states, last_hidden_states = self._dummy_run( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return func(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] outputs = self.model( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return self._call_impl(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return forward_call(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 1312, in forward (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] hidden_states = self.language_model.model( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 467, in call (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return self.forward(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1226, in forward (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] hidden_states, residual = layer( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return self._call_impl(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return forward_call(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 601, in forward (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] hidden_states = self.self_attn( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return self._call_impl(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return forward_call(*args, **kwargs) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 405, in forward (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1) (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/_tensor.py", line 1066, in split (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] return torch._VF.split_with_sizes( (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] RuntimeError: split_with_sizes expects split_sizes to sum exactly to 18432 (input tensor's size at dimension -1), but got split_sizes=[16384, 2048, 2048]

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 128 On-line CPU(s) list: 0-127 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) Gold 6430 CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 2 Stepping: 8 CPU max MHz: 3400.0000 CPU min MHz: 800.0000 BogoMIPS: 4200.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 3 MiB (64 instances) L1i cache: 2 MiB (64 instances) L2 cache: 128 MiB (64 instances) L3 cache: 120 MiB (2 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-31,64-95 NUMA node1 CPU(s): 32-63,96-127 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

PR fix notes

PR #40606: [Bugfix] Gemma-4: Add bnb QuantState alias hook on k_eq_v to load Gemma-4 BNB 4-bit weights

Repository: vllm-project/vllm
Author: zhangj1an
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40606

Description (problem / solution / changelog)

Purpose

Closes vllm-project/vllm#40437.

When we load a Gemma 4 checkpoint that uses --quantization bitsandbytes and attention_k_eq_v=true, like unsloth/gemma-4-31B-it-unsloth-bnb-4bit and unsloth/gemma-4-31B-unsloth-bnb-4bit , vllm crashes.

This is because for these models, the full_attention layers ship only q_proj and k_proj in the checkpoint. v_proj is omitted because V is identical to K by design. So on the main branch, when BitsAndBytesModelLoader builds aQuantState dict, it cannot registerv_proj entries. So the quantised dimension q_size + k_size = 16384 + 2048 = 18432, instead of q_size + k_size + v_size = 16384 + 2048 + 2048 = 20480, resulting in a mismatch.

To fix it, we let gemma 4 declare some alias that needs to be copied into QuantState, including k_proj and v_proj. Then we let BitsAndBytesModelLoader pick up these alias, so v_proj information will be included.

Note that we should not edit gemma 4 files to load k and v separately as done in https://github.com/vllm-project/vllm/pull/40606/commits/ed2bab8a1ed9a9391703e6c14e04eed124cb034a, it will break GQA. So this is the only way to fix

Test Plan

run the following:

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
    --model /root/models/gemma-4-31B-it-unsloth-bnb-4bit \                                                                               
    --gpu-memory-utilization 0.6 --max-model-len 16392 \                                                                                 
    --port 18003 --max-num-seqs 1 \                                                                                                      
    --enforce-eager --quantization bitsandbytes

Test Result

<details><summary> before fix : EngineCore crashes during profile_run. </summary>

INFO 04-22 09:02:32 [gpu_model_runner.py:4820] Model loading took 19.61 GiB memory and 7.13 seconds
INFO 04-22 09:02:32 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 8192 tokens, ...                     
ERROR 04-22 09:02:33 [core.py:1108] EngineCore failed to start.                                                                        
ERROR 04-22 09:02:33 [core.py:1108] Traceback (most recent call last):                                                                 
...                                                                                                                                    
ERROR 04-22 09:02:33 [core.py:1108]   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in        
determine_available_memory                                                                                                             
ERROR 04-22 09:02:33 [core.py:1108]     self.model_runner.profile_run()
ERROR 04-22 09:02:33 [core.py:1108]   File "/usr/local/lib/python3.11/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in 
profile_run                                                                                                                            
ERROR 04-22 09:02:33 [core.py:1108]     hidden_states, last_hidden_states = self._dummy_run(                                           
ERROR 04-22 09:02:33 [core.py:1108]   File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/gemma4_mm.py", line     
1312, in forward                                                                                                                       
ERROR 04-22 09:02:33 [core.py:1108]     hidden_states = self.language_model.model(...)                                                 
ERROR 04-22 09:02:33 [core.py:1108]   File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/gemma4.py", line 1226,  
in forward                                                                                                                             
ERROR 04-22 09:02:33 [core.py:1108]     hidden_states, residual = layer(...)                                                           
ERROR 04-22 09:02:33 [core.py:1108]   File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/gemma4.py", line 601, in
 forward                                                                                                                               
ERROR 04-22 09:02:33 [core.py:1108]     hidden_states = self.self_attn(...)                                                            
ERROR 04-22 09:02:33 [core.py:1108]   File "/usr/local/lib/python3.11/dist-packages/vllm/model_executor/models/gemma4.py", line 405, in
 forward                                                                                                                               
ERROR 04-22 09:02:33 [core.py:1108]     q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)                         
ERROR 04-22 09:02:33 [core.py:1108] RuntimeError: split_with_sizes expects split_sizes to sum exactly to 18432                         
                                       (input tensor's size at dimension -1), but got split_sizes=[16384, 2048, 2048]

</details> <details> <summary> after fix: bnb-4bit version can be used normally. </summary>

 $ curl -s -X POST http://127.0.0.1:18003/v1/chat/completions \
     -H 'Content-Type: application/json' \
     -d '{"model":"/root/models/gemma-4-31B-it-unsloth-bnb-4bit",                                                                       
          "messages":[{"role":"system","content":"You are a helpful assistant."},
                      {"role":"user","content":"What is the capital of France? Answer in one sentence."}],                              
          "max_tokens":80,"temperature":0}'                
                                                                                                                                        
 The capital of France is Paris.

</details> <details> <summary> no regression: google/gemma-4-E2B-it still runs. </summary>

INFO 04-22 10:38:36 [default_loader.py:384] Loading weights took 1.90 seconds
INFO 04-22 10:38:37 [gpu_model_runner.py:4820] Model loading took 9.89 GiB memory and 2.60 seconds

</details>

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

</details>

Changed files

vllm/model_executor/model_loader/bitsandbytes_loader.py (modified, +17/-0)
vllm/model_executor/models/gemma4.py (modified, +14/-0)
vllm/model_executor/models/gemma4_mm.py (modified, +20/-0)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.28.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 | packaged by conda-forge | (main, Mar  5 2026, 16:50:00) [GCC 14.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-141-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.3.107
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 550.54.15
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Gold 6430
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             8
CPU max MHz:                          3400.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            3 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             128 MiB (64 instances)
L3 cache:                             120 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-31,64-95
NUMA node1 CPU(s):                    32-63,96-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.6.0.dev0
[pip3] triton==3.6.0
[conda] flashinfer-python         0.6.6                    pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.18.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.5.0.dev0               pypi_0    pypi
[conda] nvidia-cutlass-dsl-libs-base 4.5.0.dev0               pypi_0    pypi
[conda] nvidia-ml-py              13.595.45                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.4.5                    pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.10.0                   pypi_0    pypi
[conda] torch-c-dlpack-ext        0.1.5                    pypi_0    pypi
[conda] torchaudio                2.10.0                   pypi_0    pypi
[conda] torchvision               0.25.0                   pypi_0    pypi
[conda] transformers              5.6.0.dev0               pypi_0    pypi
[conda] triton                    3.6.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     32-63,96-127    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     32-63,96-127    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     32-63,96-127    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     32-63,96-127    1               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/data/miniforge3/envs/gemma4/lib:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=GPU-24671c1e-079f-6322-626e-1c7bd58a02c0,GPU-9c856eb0-ae7d-b482-de7b-6dd1bcc628bd,GPU-734876d6-5318-d93e-a0b7-cb6f63190773,GPU-97d2f361-4d30-670b-1b2d-601d0f930f6d,GPU-a9b7fc96-a634-86fe-6175-369ff7d0dccc,GPU-71988e85-8559-41ab-8495-34484d4cc5ba,GPU-c9005258-78e3-6e32-a60e-52157c238bd7,GPU-ccc1fe3c-96b2-d25a-169c-7a39b99e992e
CUBLAS_VERSION=12.3.4.1
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.19.stable.20231214+cuda12.3
NCCL_SOCKET_IFNAME=eth0
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NCCL_DEBUG=INFO
VLLM_WORKER_MULTIPROC_METHOD=spawn
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.3.2.001
PYTORCH_VERSION=2.3.0a0+ebedce2
PYTORCH_BUILD_NUMBER=0
MAX_JOBS=4
CUDNN_VERSION=9.0.0.306
PYTORCH_HOME=/opt/pytorch/pytorch
NVIDIA_BUILD_ID=82611821
CUDA_DRIVER_VERSION=545.23.08
PYTORCH_BUILD_VERSION=2.3.0a0+ebedce2
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_PYTORCH_VERSION=24.02
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

LD_LIBRARY_PATH=/data/miniforge3/envs/gemma4/lib:$LD_LIBRARY_PATH CUDA_VISIBLE_DEVICES=4 python -m vllm.entrypoints.openai.api_server --model gemma-4-31B-it-unsloth-bnb-4bit/ --gpu-memory-utilization 0.6 --max-model-len 16392 --port 18003 --max-num-seqs 1 --enforce-eager --quantization bitsandbytes

---

(EngineCore pid=49951) INFO 04-21 13:58:31 [gpu_model_runner.py:4820] Model loading took 19.61 GiB memory and 7.125814 seconds
(EngineCore pid=49951) INFO 04-21 13:58:32 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 video items of the maximum feature size.
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] EngineCore failed to start.
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     super().__init__(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     self.model_runner.profile_run()
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     outputs = self.model(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]               ^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 1312, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states = self.language_model.model(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 467, in __call__
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self.forward(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1226, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states, residual = layer(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                               ^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 601, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states = self.self_attn(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                     ^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 405, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/_tensor.py", line 1066, in split
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return torch._VF.split_with_sizes(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] RuntimeError: split_with_sizes expects split_sizes to sum exactly to 18432 (input tensor's size at dimension -1), but got split_sizes=[16384, 2048, 2048]

---

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "gemma-4-31B-it-unsloth-bnb-4bit"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)
# Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short prose"},
]

# Process input
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
print(response)

---

The old clock on the mantel had forgotten how to keep time, its brass gears surrendered to a layer of velvet dust. It sat in a room that smelled of dried lavender and old paper, where the sunlight filtered through heavy linen curtains in long, slanted columns of gold.

Elara sat in the wingback chair, her fingers tracing the embossed leather of a journal that hadn't been opened in forty years. Outside, the autumn wind chased copper leaves across the cobblestones, a frantic dance of departure. She remembered when the house had been full of noise—the rhythmic thumping of boots, the bright collision of laughter, the scent of roasting coffee. Now, the silence was a physical thing, a heavy blanket that draped over the furniture and settled in the corners of the ceiling.

She opened the book to a pressed cornflower, its blue now a ghost of a color. As she read the handwritten ink—looped and hurried, written by a hand that had long since vanished—she felt the stillness of the room shift. For a fleeting moment, the air grew warm, and the silence didn't feel like an absence, but like a breath held in anticipation.

She closed her eyes and listened. Not to the ticking of the broken clock, but to the echo of a voice that lived only in the spaces between the heartbeats, reminding her that nothing is ever truly gone as long as there is a place for it to be remembered.<turn|>

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.3 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version                : Could not collect
CMake version                : version 3.28.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A
XPU used to build PyTorch    : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 | packaged by conda-forge | (main, Mar  5 2026, 16:50:00) [GCC 14.3.0] (64-bit runtime)
Python platform              : Linux-5.15.0-141-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : 12.3.107
CUDA_MODULE_LOADING set to   : LAZY
GPU models and configuration :
GPU 0: NVIDIA H100 80GB HBM3
GPU 1: NVIDIA H100 80GB HBM3
GPU 2: NVIDIA H100 80GB HBM3
GPU 3: NVIDIA H100 80GB HBM3
GPU 4: NVIDIA H100 80GB HBM3
GPU 5: NVIDIA H100 80GB HBM3
GPU 6: NVIDIA H100 80GB HBM3
GPU 7: NVIDIA H100 80GB HBM3

Nvidia driver version        : 550.54.15
cuDNN version                : Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.0.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.0.0
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               128
On-line CPU(s) list:                  0-127
Vendor ID:                            GenuineIntel
Model name:                           Intel(R) Xeon(R) Gold 6430
CPU family:                           6
Model:                                143
Thread(s) per core:                   2
Core(s) per socket:                   32
Socket(s):                            2
Stepping:                             8
CPU max MHz:                          3400.0000
CPU min MHz:                          800.0000
BogoMIPS:                             4200.00
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 invpcid_single intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                       VT-x
L1d cache:                            3 MiB (64 instances)
L1i cache:                            2 MiB (64 instances)
L2 cache:                             128 MiB (64 instances)
L3 cache:                             120 MiB (2 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-31,64-95
NUMA node1 CPU(s):                    32-63,96-127
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] flashinfer-python==0.6.6
[pip3] numpy==2.2.6
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cudnn-frontend==1.18.0
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-cufile-cu12==1.13.1.3
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-cutlass-dsl==4.5.0.dev0
[pip3] nvidia-cutlass-dsl-libs-base==4.5.0.dev0
[pip3] nvidia-ml-py==13.595.45
[pip3] nvidia-nccl-cu12==2.27.5
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvshmem-cu12==3.4.5
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] pyzmq==27.1.0
[pip3] torch==2.10.0
[pip3] torch_c_dlpack_ext==0.1.5
[pip3] torchaudio==2.10.0
[pip3] torchvision==0.25.0
[pip3] transformers==5.6.0.dev0
[pip3] triton==3.6.0
[conda] flashinfer-python         0.6.6                    pypi_0    pypi
[conda] numpy                     2.2.6                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cudnn-frontend     1.18.0                   pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-cufile-cu12        1.13.1.3                 pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-cutlass-dsl        4.5.0.dev0               pypi_0    pypi
[conda] nvidia-cutlass-dsl-libs-base 4.5.0.dev0               pypi_0    pypi
[conda] nvidia-ml-py              13.595.45                pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.5                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvshmem-cu12       3.4.5                    pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] pyzmq                     27.1.0                   pypi_0    pypi
[conda] torch                     2.10.0                   pypi_0    pypi
[conda] torch-c-dlpack-ext        0.1.5                    pypi_0    pypi
[conda] torchaudio                2.10.0                   pypi_0    pypi
[conda] torchvision               0.25.0                   pypi_0    pypi
[conda] transformers              5.6.0.dev0               pypi_0    pypi
[conda] triton                    3.6.0                    pypi_0    pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1
vLLM Build Flags:
  CUDA Archs: 5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7    NIC0    NIC1    NIC2    NIC3    NIC4    NIC5    NIC6    NIC7    NIC8    NIC9    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      NV18    NV18    NV18    NV18    NV18    NV18    NV18    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU1    NV18     X      NV18    NV18    NV18    NV18    NV18    NV18    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU2    NV18    NV18     X      NV18    NV18    NV18    NV18    NV18    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU3    NV18    NV18    NV18     X      NV18    NV18    NV18    NV18    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     0-31,64-95      0               N/A
GPU4    NV18    NV18    NV18    NV18     X      NV18    NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     32-63,96-127    1               N/A
GPU5    NV18    NV18    NV18    NV18    NV18     X      NV18    NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     32-63,96-127    1               N/A
GPU6    NV18    NV18    NV18    NV18    NV18    NV18     X      NV18    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     32-63,96-127    1               N/A
GPU7    NV18    NV18    NV18    NV18    NV18    NV18    NV18     X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     32-63,96-127    1               N/A
NIC0    PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC1    SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC2    SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS     SYS
NIC3    SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS     SYS     SYS     SYS
NIC4    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      PIX     SYS     SYS     SYS     SYS
NIC5    SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX      X      SYS     SYS     SYS     SYS
NIC6    SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS     SYS
NIC7    SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS     SYS
NIC8    SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X      SYS
NIC9    SYS     SYS     SYS     SYS     SYS     SYS     SYS     PIX     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS     SYS      X

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1
  NIC2: mlx5_2
  NIC3: mlx5_3
  NIC4: mlx5_4
  NIC5: mlx5_5
  NIC6: mlx5_6
  NIC7: mlx5_7
  NIC8: mlx5_8
  NIC9: mlx5_9

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/data/miniforge3/envs/gemma4/lib:/usr/local/lib/python3.10/dist-packages/torch/lib:/usr/local/lib/python3.10/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64
NVIDIA_VISIBLE_DEVICES=GPU-24671c1e-079f-6322-626e-1c7bd58a02c0,GPU-9c856eb0-ae7d-b482-de7b-6dd1bcc628bd,GPU-734876d6-5318-d93e-a0b7-cb6f63190773,GPU-97d2f361-4d30-670b-1b2d-601d0f930f6d,GPU-a9b7fc96-a634-86fe-6175-369ff7d0dccc,GPU-71988e85-8559-41ab-8495-34484d4cc5ba,GPU-c9005258-78e3-6e32-a60e-52157c238bd7,GPU-ccc1fe3c-96b2-d25a-169c-7a39b99e992e
CUBLAS_VERSION=12.3.4.1
NVIDIA_REQUIRE_CUDA=cuda>=9.0
CUDA_CACHE_DISABLE=1
TORCH_CUDA_ARCH_LIST=5.2 6.0 6.1 7.0 7.2 7.5 8.0 8.6 8.7 9.0+PTX
NCCL_VERSION=2.19.stable.20231214+cuda12.3
NCCL_SOCKET_IFNAME=eth0
NVIDIA_DRIVER_CAPABILITIES=compute,utility,video
NCCL_DEBUG=INFO
VLLM_WORKER_MULTIPROC_METHOD=spawn
NVIDIA_PRODUCT_NAME=PyTorch
CUDA_VERSION=12.3.2.001
PYTORCH_VERSION=2.3.0a0+ebedce2
PYTORCH_BUILD_NUMBER=0
MAX_JOBS=4
CUDNN_VERSION=9.0.0.306
PYTORCH_HOME=/opt/pytorch/pytorch
NVIDIA_BUILD_ID=82611821
CUDA_DRIVER_VERSION=545.23.08
PYTORCH_BUILD_VERSION=2.3.0a0+ebedce2
CUDA_HOME=/usr/local/cuda
CUDA_HOME=/usr/local/cuda
CUDA_MODULE_LOADING=LAZY
NVIDIA_PYTORCH_VERSION=24.02
TORCH_ALLOW_TF32_CUBLAS_OVERRIDE=1
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

</details>

🐛 Describe the bug

When I tried to deploy the gemma-4-31B-it-unsloth-bnb-4bit model through vllm, an error occurred during the deployment process。

My deployment command is as follows:

LD_LIBRARY_PATH=/data/miniforge3/envs/gemma4/lib:$LD_LIBRARY_PATH CUDA_VISIBLE_DEVICES=4 python -m vllm.entrypoints.openai.api_server --model gemma-4-31B-it-unsloth-bnb-4bit/ --gpu-memory-utilization 0.6 --max-model-len 16392 --port 18003 --max-num-seqs 1 --enforce-eager --quantization bitsandbytes

The error message is as follows:

(EngineCore pid=49951) INFO 04-21 13:58:31 [gpu_model_runner.py:4820] Model loading took 19.61 GiB memory and 7.125814 seconds
(EngineCore pid=49951) INFO 04-21 13:58:32 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 8192 tokens, and profiled with 3 video items of the maximum feature size.
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] EngineCore failed to start.
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] Traceback (most recent call last):
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1082, in run_engine_core
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 848, in __init__
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     super().__init__(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 124, in __init__
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 247, in _initialize_kv_caches
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/executor/abstract.py", line 136, in determine_available_memory
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     self.model_runner.profile_run()
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5782, in profile_run
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return func(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 5474, in _dummy_run
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     outputs = self.model(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]               ^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4_mm.py", line 1312, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states = self.language_model.model(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/compilation/decorators.py", line 467, in __call__
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self.forward(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 1226, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states, residual = layer(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                               ^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 601, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     hidden_states = self.self_attn(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]                     ^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return self._call_impl(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return forward_call(*args, **kwargs)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/vllm/model_executor/models/gemma4.py", line 405, in forward
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     q, k, v = qkv.split([self.q_size, self.kv_size, self.kv_size], dim=-1)
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]   File "/data/miniforge3/envs/gemma4_unsloth/lib/python3.12/site-packages/torch/_tensor.py", line 1066, in split
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]     return torch._VF.split_with_sizes(
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=49951) ERROR 04-21 13:58:46 [core.py:1108] RuntimeError: split_with_sizes expects split_sizes to sum exactly to 18432 (input tensor's size at dimension -1), but got split_sizes=[16384, 2048, 2048]

When I load a model using Transformers and perform inference, the model can output results normally. The code for loading and inferring the model using Transformers is as follows:

from transformers import AutoProcessor, AutoModelForCausalLM

MODEL_ID = "gemma-4-31B-it-unsloth-bnb-4bit"

# Load model
processor = AutoProcessor.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID,
    dtype="auto",
    device_map="auto"
)
# Prompt
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short prose"},
]

# Process input
text = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)
inputs = processor(text=text, return_tensors="pt").to(model.device)
input_len = inputs["input_ids"].shape[-1]

# Generate output
outputs = model.generate(**inputs, max_new_tokens=1024)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)

# Parse output
processor.parse_response(response)
print(response)

The code execution result is as follows:

The old clock on the mantel had forgotten how to keep time, its brass gears surrendered to a layer of velvet dust. It sat in a room that smelled of dried lavender and old paper, where the sunlight filtered through heavy linen curtains in long, slanted columns of gold.

Elara sat in the wingback chair, her fingers tracing the embossed leather of a journal that hadn't been opened in forty years. Outside, the autumn wind chased copper leaves across the cobblestones, a frantic dance of departure. She remembered when the house had been full of noise—the rhythmic thumping of boots, the bright collision of laughter, the scent of roasting coffee. Now, the silence was a physical thing, a heavy blanket that draped over the furniture and settled in the corners of the ceiling.

She opened the book to a pressed cornflower, its blue now a ghost of a color. As she read the handwritten ink—looped and hurried, written by a hand that had long since vanished—she felt the stillness of the room shift. For a fleeting moment, the air grew warm, and the silence didn't feel like an absence, but like a breath held in anticipation.

She closed her eyes and listened. Not to the ticking of the broken clock, but to the echo of a voice that lived only in the spaces between the heartbeats, reminding her that nothing is ever truly gone as long as there is a place for it to be remembered.<turn|>

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to a mismatch between the model's expected input size and the actual input size, causing a RuntimeError when trying to split the input tensor.

Guidance

Check the model's input size: Verify that the model is expecting an input size of 18432, and that the input tensor being passed to the model has a size of 18432 at dimension -1.
Verify the split sizes: Ensure that the split sizes [16384, 2048, 2048] sum up to the expected input size of 18432.
Adjust the input or model: If the input size is incorrect, adjust the input to match the model's expected size, or modify the model to accept the current input size.
Check for version compatibility: Ensure that the version of the transformers library and the vllm library are compatible, as changes in the library versions may affect the input size expectations.

Example

No code example is provided as the issue is related to the specific model and input sizes, and the code provided in the issue is not directly related to the error.

Notes

The error message indicates a RuntimeError caused by a mismatch between the expected input size and the actual input size. The issue may be specific to the gemma-4-31B-it-unsloth-bnb-4bit model and the vllm library.

Recommendation

Apply a workaround by adjusting the input size to match the model's expected size, or modify the model to accept the current input size. If the issue persists, consider upgrading to a newer version of the vllm library or seeking further assistance from the library maintainers.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #agent execution #callback error #model loading #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: error in the vllm deployment model gemma-4-31B-it-unsloth-bnb-4bit [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #40606: [Bugfix] Gemma-4: Add bnb QuantState alias hook on k_eq_v to load Gemma-4 BNB 4-bit weights

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: error in the vllm deployment model gemma-4-31B-it-unsloth-bnb-4bit [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

============================== CPU Info

PR fix notes

PR #40606: [Bugfix] Gemma-4: Add bnb QuantState alias hook on k_eq_v to load Gemma-4 BNB 4-bit weights

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING