vllm - ✅(Solved) Fix Bug: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38591Fetched 2026-04-08 01:53:14
View on GitHub
Comments
1
Participants
2
Timeline
2
Reactions
0
Timeline (top)
commented ×1cross-referenced ×1

When attempting to run offline inference with the newly supported Qwen/Qwen3.5-4B model on a non-CUDA environment (macOS Apple Silicon / CPU), the LLM initialization crashes during weight loading.

The crash occurs inside dispatch_cpu_unquantized_gemm because it assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but the new Qwen3_5ForConditionalGeneration architecture appears to have layers with 1D or 3D+ weights being passed through the unquantized GEMM CPU dispatch.

Note: The model loads and serves correctly when running on CUDA environments via the OpenAI compatible server, indicating this is specifically a bug with the CPU/MPS fallback weight loader in vLLM.

Error Message

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B... ... (EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start. (EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last): ... (EngineCore pid=95230) File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading (EngineCore pid=95230) quant_method.process_weights_after_loading(module) (EngineCore pid=95230) File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading (EngineCore pid=95230) dispatch_cpu_unquantized_gemm(layer, remove_weight=True) (EngineCore pid=95230) File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm (EngineCore pid=95230) N, K = layer.weight.size() (EngineCore pid=95230) ^^^^ (EngineCore pid=95230) ValueError: too many values to unpack (expected 2) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

Root Cause

The crash occurs inside dispatch_cpu_unquantized_gemm because it assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but the new Qwen3_5ForConditionalGeneration architecture appears to have layers with 1D or 3D+ weights being passed through the unquantized GEMM CPU dispatch.

Fix Action

Fix / Workaround

Bug Report: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B

The crash occurs inside dispatch_cpu_unquantized_gemm because it assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but the new Qwen3_5ForConditionalGeneration architecture appears to have layers with 1D or 3D+ weights being passed through the unquantized GEMM CPU dispatch.

Traceback

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B...
...
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last):
...
(EngineCore pid=95230)   File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=95230)     quant_method.process_weights_after_loading(module)
(EngineCore pid=95230)   File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading
(EngineCore pid=95230)     dispatch_cpu_unquantized_gemm(layer, remove_weight=True)
(EngineCore pid=95230)   File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm
(EngineCore pid=95230)     N, K = layer.weight.size()
(EngineCore pid=95230)     ^^^^
(EngineCore pid=95230) ValueError: too many values to unpack (expected 2)
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

PR fix notes

PR #38600: [Bugfix] too many values to unpack in dispatch_cpu_unquantized_gemm

Description (problem / solution / changelog)

Purpose

Fix #38591: ValueError when loading Qwen3.5-4B on CPU/Mac environments.

The dispatch_cpu_unquantized_gemm function assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but multimodal models like Qwen3.5 have vision encoder layers (e.g., Conv3dLayer) with 5D weights. This causes a ValueError: too many values to unpack (expected 2) crash during model loading on CPU platforms.

Root Cause: The function unconditionally unpacks layer.weight.size() into two values, which fails for non-2D weight tensors.

Fix: Add a dimension check to skip non-2D weights and fall back to standard torch.nn.functional.linear. This allows Conv layers (3D/4D/5D weights) to bypass the GEMM optimization path that only supports 2D weight tensors.

Test Plan

  1. Test Qwen3.5-4B loading on CPU:
python -c "
from vllm import LLM
llm = LLM(
    model='Qwen/Qwen3.5-4B',
    trust_remote_code=True,
    max_model_len=16384,
    enforce_eager=True,
)
print('Model loaded successfully!')
"
  1. Run existing CPU tests to ensure no regression:
.venv/bin/python -m pytest tests/kernels/test_onednn.py -v

Test Result

Before fix: Model loading crashes with: ValueError: too many values to unpack (expected 2) File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm N, K = layer.weight.size()

After fix: Model loads successfully on CPU/Mac. Non-2D weight layers (Conv) fall back to standard linear implementation, while 2D weight layers continue to use optimized GEMM kernels.


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0 .
</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing

Changed files

  • vllm/model_executor/layers/utils.py (modified, +7/-0)

PR #3099: feat: add vLLM Chat Generator

Description (problem / solution / changelog)

Related Issues

Proposed Changes:

  • Add vllm-haystack integration scaffolding
  • Implement a vLLM Chat Generator: similar to OpenAIChatGenerator but specifically handles reasoning

How did you test it?

CI: new unit tests and integration tests (using Qwen/Qwen3-0.6B)

Notes for the reviewer

I initially inherited from OpenAIChatGenerator, then I realized that I was overriding most methods. Plus, vLLM is a bit simpler: no tools_strict handling, no different endpoint for structured generation... So I ended up building a simple standalone component.

Checklist

Changed files

  • .github/labeler.yml (modified, +5/-0)
  • .github/workflows/CI_coverage_comment.yml (modified, +2/-1)
  • .github/workflows/CI_workflows_linting.yml (modified, +1/-1)
  • .github/workflows/vllm.yml (added, +180/-0)
  • README.md (modified, +1/-0)
  • integrations/vllm/LICENSE.txt (added, +201/-0)
  • integrations/vllm/README.md (added, +20/-0)
  • integrations/vllm/pydoc/config_docusaurus.yml (added, +13/-0)
  • integrations/vllm/pyproject.toml (added, +161/-0)
  • integrations/vllm/src/haystack_integrations/components/generators/py.typed (added, +0/-0)
  • integrations/vllm/src/haystack_integrations/components/generators/vllm/__init__.py (added, +7/-0)
  • integrations/vllm/src/haystack_integrations/components/generators/vllm/chat/__init__.py (added, +3/-0)
  • integrations/vllm/src/haystack_integrations/components/generators/vllm/chat/chat_generator.py (added, +549/-0)
  • integrations/vllm/tests/__init__.py (added, +3/-0)
  • integrations/vllm/tests/test_chat_generator.py (added, +634/-0)

Code Example

from vllm import LLM

# Fails during initialization before generation even starts
llm = LLM(
    model="Qwen/Qwen3.5-4B",
    trust_remote_code=True,
    max_model_len=16384,
    enforce_eager=True,
)

---

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3.5-4B

---

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B...
...
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last):
...
(EngineCore pid=95230)   File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=95230)     quant_method.process_weights_after_loading(module)
(EngineCore pid=95230)   File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading
(EngineCore pid=95230)     dispatch_cpu_unquantized_gemm(layer, remove_weight=True)
(EngineCore pid=95230)   File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm
(EngineCore pid=95230)     N, K = layer.weight.size()
(EngineCore pid=95230)     ^^^^
(EngineCore pid=95230) ValueError: too many values to unpack (expected 2)
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

---

def dispatch_cpu_unquantized_gemm(layer, remove_weight=True):
    # ...
    N, K = layer.weight.size() # <--- CRASH HERE
RAW_BUFFERClick to expand / collapse

Bug Report: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B

Description

When attempting to run offline inference with the newly supported Qwen/Qwen3.5-4B model on a non-CUDA environment (macOS Apple Silicon / CPU), the LLM initialization crashes during weight loading.

The crash occurs inside dispatch_cpu_unquantized_gemm because it assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but the new Qwen3_5ForConditionalGeneration architecture appears to have layers with 1D or 3D+ weights being passed through the unquantized GEMM CPU dispatch.

Note: The model loads and serves correctly when running on CUDA environments via the OpenAI compatible server, indicating this is specifically a bug with the CPU/MPS fallback weight loader in vLLM.

Environment

  • OS: macOS (Apple Silicon M-series)
  • vLLM Version: Nightly / Source (commit ab1a6a43fa9500697dd01e73aa372c8777cd7a5b)
  • Python Version: 3.12.12
  • Model: Qwen/Qwen3.5-4B

Steps to Reproduce

Run the standard offline LLM initialization OR run the api_server on a CPU/Mac machine:

Method 1 (Offline inference):

from vllm import LLM

# Fails during initialization before generation even starts
llm = LLM(
    model="Qwen/Qwen3.5-4B",
    trust_remote_code=True,
    max_model_len=16384,
    enforce_eager=True,
)

Method 2 (Serving):

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3.5-4B

Traceback

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B...
...
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last):
...
(EngineCore pid=95230)   File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=95230)     quant_method.process_weights_after_loading(module)
(EngineCore pid=95230)   File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading
(EngineCore pid=95230)     dispatch_cpu_unquantized_gemm(layer, remove_weight=True)
(EngineCore pid=95230)   File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm
(EngineCore pid=95230)     N, K = layer.weight.size()
(EngineCore pid=95230)     ^^^^
(EngineCore pid=95230) ValueError: too many values to unpack (expected 2)
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

Analysis / Possible Fix

In vllm/model_executor/layers/utils.py inside dispatch_cpu_unquantized_gemm:

def dispatch_cpu_unquantized_gemm(layer, remove_weight=True):
    # ...
    N, K = layer.weight.size() # <--- CRASH HERE

For Qwen3_5ForConditionalGeneration, one of the layers iterating through this process has a weight tensor that does not have exactly 2 dimensions (likely the vision encoder layers or a specific 1D projection bias treated as a weight).

Adding a dimension check or modifying the CPU dispatch logic to handle Qwen3.5's multimodal weight shapes should resolve this for Mac/CPU users.

extent analysis

Fix Plan

To resolve the ValueError: too many values to unpack issue in dispatch_cpu_unquantized_gemm, we need to modify the CPU dispatch logic to handle multimodal weight shapes in the Qwen3.5-4B model.

Step-by-Step Solution

  • Modify the dispatch_cpu_unquantized_gemm function in vllm/model_executor/layers/utils.py to check the number of dimensions in the layer.weight tensor.
  • Handle cases where the weight tensor has more or less than 2 dimensions.

Example code:

def dispatch_cpu_unquantized_gemm(layer, remove_weight=True):
    weight_size = layer.weight.size()
    if len(weight_size) == 2:
        N, K = weight_size
        # existing logic for 2D weights
    elif len(weight_size) == 1:
        # handle 1D weights (e.g., bias terms)
        N = weight_size[0]
        K = 1
        # modify logic to handle 1D weights
    else:
        # handle 3D+ weights (e.g., vision encoder layers)
        # modify logic to handle multimodal weights
        # e.g., use `torch.matmul` with appropriate reshaping
        pass
  • Update the logic to handle 1D and 3D+ weights according to the specific requirements of the Qwen3.5-4B model.

Verification

To verify the fix, run the offline LLM initialization or the api_server on a CPU/Mac machine using the modified code. The ValueError: too many values to unpack error should be resolved, and the model should load and serve correctly.

Extra Tips

  • Ensure that the modified logic handles all possible weight shapes in the Qwen3.5-4B model.
  • Test the modified code thoroughly to ensure that it works correctly for all use cases.
  • Consider adding additional logging or debugging statements to help diagnose any future issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix Bug: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B [2 pull requests, 1 comments, 2 participants]