vllm - ✅(Solved) Fix Bug: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B [2 pull requests, 1 comments, 2 participants]

vllm2026-03-30 22:17:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38591•Fetched 2026-04-08 01:53:14

View on GitHub

Comments

Participants

Timeline

Reactions

Author

miguel-flowstate

Participants

boymucheng

miguel-flowstate

Timeline (top)

commented ×1cross-referenced ×1

When attempting to run offline inference with the newly supported Qwen/Qwen3.5-4B model on a non-CUDA environment (macOS Apple Silicon / CPU), the LLM initialization crashes during weight loading.

The crash occurs inside dispatch_cpu_unquantized_gemm because it assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but the new Qwen3_5ForConditionalGeneration architecture appears to have layers with 1D or 3D+ weights being passed through the unquantized GEMM CPU dispatch.

Note: The model loads and serves correctly when running on CUDA environments via the OpenAI compatible server, indicating this is specifically a bug with the CPU/MPS fallback weight loader in vLLM.

Error Message

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B... ... (EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start. (EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last): ... (EngineCore pid=95230) File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading (EngineCore pid=95230) quant_method.process_weights_after_loading(module) (EngineCore pid=95230) File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading (EngineCore pid=95230) dispatch_cpu_unquantized_gemm(layer, remove_weight=True) (EngineCore pid=95230) File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm (EngineCore pid=95230) N, K = layer.weight.size() (EngineCore pid=95230) ^^^^ (EngineCore pid=95230) ValueError: too many values to unpack (expected 2) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

Root Cause

Fix Action

Fix / Workaround

Bug Report: `ValueError: too many values to unpack` in `dispatch_cpu_unquantized_gemm` when loading Qwen3.5-4B

Traceback

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B...
...
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last):
...
(EngineCore pid=95230)   File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=95230)     quant_method.process_weights_after_loading(module)
(EngineCore pid=95230)   File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading
(EngineCore pid=95230)     dispatch_cpu_unquantized_gemm(layer, remove_weight=True)
(EngineCore pid=95230)   File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm
(EngineCore pid=95230)     N, K = layer.weight.size()
(EngineCore pid=95230)     ^^^^
(EngineCore pid=95230) ValueError: too many values to unpack (expected 2)
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

PR fix notes

PR #38600: [Bugfix] too many values to unpack in dispatch_cpu_unquantized_gemm

Repository: vllm-project/vllm
Author: boymucheng
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/38600

Description (problem / solution / changelog)

Purpose

Fix #38591: ValueError when loading Qwen3.5-4B on CPU/Mac environments.

The dispatch_cpu_unquantized_gemm function assumes all linear layer weights are exactly 2-dimensional (N, K = layer.weight.size()), but multimodal models like Qwen3.5 have vision encoder layers (e.g., Conv3dLayer) with 5D weights. This causes a ValueError: too many values to unpack (expected 2) crash during model loading on CPU platforms.

Root Cause: The function unconditionally unpacks layer.weight.size() into two values, which fails for non-2D weight tensors.

Fix: Add a dimension check to skip non-2D weights and fall back to standard torch.nn.functional.linear. This allows Conv layers (3D/4D/5D weights) to bypass the GEMM optimization path that only supports 2D weight tensors.

Test Plan

Test Qwen3.5-4B loading on CPU:

python -c "
from vllm import LLM
llm = LLM(
    model='Qwen/Qwen3.5-4B',
    trust_remote_code=True,
    max_model_len=16384,
    enforce_eager=True,
)
print('Model loaded successfully!')
"

Run existing CPU tests to ensure no regression:

.venv/bin/python -m pytest tests/kernels/test_onednn.py -v

Test Result

Before fix: Model loading crashes with: ValueError: too many values to unpack (expected 2) File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm N, K = layer.weight.size()

After fix: Model loads successfully on CPU/Mac. Non-2D weight layers (Conv) fall back to standard linear implementation, while 2D weight layers continue to use optimized GEMM kernels.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0 .

</details>

BEFORE SUBMITTING, PLEASE READ https://docs.vllm.ai/en/latest/contributing

Changed files

vllm/model_executor/layers/utils.py (modified, +7/-0)

PR #3099: feat: add vLLM Chat Generator

Repository: deepset-ai/haystack-core-integrations
Author: anakin87
State: closed | merged: True
Link: https://github.com/deepset-ai/haystack-core-integrations/pull/3099

Description (problem / solution / changelog)

Related Issues

part of https://github.com/deepset-ai/haystack-core-integrations/issues/2007
fixes #1958

Proposed Changes:

Add vllm-haystack integration scaffolding
Implement a vLLM Chat Generator: similar to OpenAIChatGenerator but specifically handles reasoning

How did you test it?

CI: new unit tests and integration tests (using Qwen/Qwen3-0.6B)

Notes for the reviewer

I initially inherited from OpenAIChatGenerator, then I realized that I was overriding most methods. Plus, vLLM is a bit simpler: no tools_strict handling, no different endpoint for structured generation... So I ended up building a simple standalone component.

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.

Changed files

.github/labeler.yml (modified, +5/-0)
.github/workflows/CI_coverage_comment.yml (modified, +2/-1)
.github/workflows/CI_workflows_linting.yml (modified, +1/-1)
.github/workflows/vllm.yml (added, +180/-0)
README.md (modified, +1/-0)
integrations/vllm/LICENSE.txt (added, +201/-0)
integrations/vllm/README.md (added, +20/-0)
integrations/vllm/pydoc/config_docusaurus.yml (added, +13/-0)
integrations/vllm/pyproject.toml (added, +161/-0)
integrations/vllm/src/haystack_integrations/components/generators/py.typed (added, +0/-0)
integrations/vllm/src/haystack_integrations/components/generators/vllm/__init__.py (added, +7/-0)
integrations/vllm/src/haystack_integrations/components/generators/vllm/chat/__init__.py (added, +3/-0)
integrations/vllm/src/haystack_integrations/components/generators/vllm/chat/chat_generator.py (added, +549/-0)
integrations/vllm/tests/__init__.py (added, +3/-0)
integrations/vllm/tests/test_chat_generator.py (added, +634/-0)

Code Example

from vllm import LLM

# Fails during initialization before generation even starts
llm = LLM(
    model="Qwen/Qwen3.5-4B",
    trust_remote_code=True,
    max_model_len=16384,
    enforce_eager=True,
)

---

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3.5-4B

---

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B...
...
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last):
...
(EngineCore pid=95230)   File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=95230)     quant_method.process_weights_after_loading(module)
(EngineCore pid=95230)   File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading
(EngineCore pid=95230)     dispatch_cpu_unquantized_gemm(layer, remove_weight=True)
(EngineCore pid=95230)   File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm
(EngineCore pid=95230)     N, K = layer.weight.size()
(EngineCore pid=95230)     ^^^^
(EngineCore pid=95230) ValueError: too many values to unpack (expected 2)
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

---

def dispatch_cpu_unquantized_gemm(layer, remove_weight=True):
    # ...
    N, K = layer.weight.size() # <--- CRASH HERE

RAW_BUFFERClick to expand / collapse

Bug Report: `ValueError: too many values to unpack` in `dispatch_cpu_unquantized_gemm` when loading Qwen3.5-4B

Description

Note: The model loads and serves correctly when running on CUDA environments via the OpenAI compatible server, indicating this is specifically a bug with the CPU/MPS fallback weight loader in vLLM.

Environment

OS: macOS (Apple Silicon M-series)
vLLM Version: Nightly / Source (commit ab1a6a43fa9500697dd01e73aa372c8777cd7a5b)
Python Version: 3.12.12
Model: Qwen/Qwen3.5-4B

Steps to Reproduce

Run the standard offline LLM initialization OR run the api_server on a CPU/Mac machine:

Method 1 (Offline inference):

from vllm import LLM

# Fails during initialization before generation even starts
llm = LLM(
    model="Qwen/Qwen3.5-4B",
    trust_remote_code=True,
    max_model_len=16384,
    enforce_eager=True,
)

Method 2 (Serving):

python -m vllm.entrypoints.openai.api_server --model Qwen/Qwen3.5-4B

Traceback

INFO 03-30 11:44:47 [cpu_model_runner.py:71] Starting to load model Qwen/Qwen3.5-4B...
...
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] EngineCore failed to start.
(EngineCore pid=95230) ERROR 03-30 11:44:54 [core.py:1108] Traceback (most recent call last):
...
(EngineCore pid=95230)   File "vllm/model_executor/model_loader/utils.py", line 107, in process_weights_after_loading
(EngineCore pid=95230)     quant_method.process_weights_after_loading(module)
(EngineCore pid=95230)   File "vllm/model_executor/layers/linear.py", line 218, in process_weights_after_loading
(EngineCore pid=95230)     dispatch_cpu_unquantized_gemm(layer, remove_weight=True)
(EngineCore pid=95230)   File "vllm/model_executor/layers/utils.py", line 231, in dispatch_cpu_unquantized_gemm
(EngineCore pid=95230)     N, K = layer.weight.size()
(EngineCore pid=95230)     ^^^^
(EngineCore pid=95230) ValueError: too many values to unpack (expected 2)
RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {'EngineCore': 1}

Analysis / Possible Fix

In vllm/model_executor/layers/utils.py inside dispatch_cpu_unquantized_gemm:

def dispatch_cpu_unquantized_gemm(layer, remove_weight=True):
    # ...
    N, K = layer.weight.size() # <--- CRASH HERE

For Qwen3_5ForConditionalGeneration, one of the layers iterating through this process has a weight tensor that does not have exactly 2 dimensions (likely the vision encoder layers or a specific 1D projection bias treated as a weight).

Adding a dimension check or modifying the CPU dispatch logic to handle Qwen3.5's multimodal weight shapes should resolve this for Mac/CPU users.

extent analysis

Fix Plan

To resolve the ValueError: too many values to unpack issue in dispatch_cpu_unquantized_gemm, we need to modify the CPU dispatch logic to handle multimodal weight shapes in the Qwen3.5-4B model.

Step-by-Step Solution

Modify the dispatch_cpu_unquantized_gemm function in vllm/model_executor/layers/utils.py to check the number of dimensions in the layer.weight tensor.
Handle cases where the weight tensor has more or less than 2 dimensions.

Example code:

def dispatch_cpu_unquantized_gemm(layer, remove_weight=True):
    weight_size = layer.weight.size()
    if len(weight_size) == 2:
        N, K = weight_size
        # existing logic for 2D weights
    elif len(weight_size) == 1:
        # handle 1D weights (e.g., bias terms)
        N = weight_size[0]
        K = 1
        # modify logic to handle 1D weights
    else:
        # handle 3D+ weights (e.g., vision encoder layers)
        # modify logic to handle multimodal weights
        # e.g., use `torch.matmul` with appropriate reshaping
        pass

Update the logic to handle 1D and 3D+ weights according to the specific requirements of the Qwen3.5-4B model.

Verification

To verify the fix, run the offline LLM initialization or the api_server on a CPU/Mac machine using the modified code. The ValueError: too many values to unpack error should be resolved, and the model should load and serve correctly.

Extra Tips

Ensure that the modified logic handles all possible weight shapes in the Qwen3.5-4B model.
Test the modified code thoroughly to ensure that it works correctly for all use cases.
Consider adding additional logging or debugging statements to help diagnose any future issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix Bug: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Bug Report: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B

Traceback

PR fix notes

PR #38600: [Bugfix] too many values to unpack in dispatch_cpu_unquantized_gemm

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #3099: feat: add vLLM Chat Generator

Description (problem / solution / changelog)

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

Changed files

Code Example

Bug Report: ValueError: too many values to unpack in dispatch_cpu_unquantized_gemm when loading Qwen3.5-4B

Description

Environment

Steps to Reproduce

Traceback

Analysis / Possible Fix

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Bug Report: `ValueError: too many values to unpack` in `dispatch_cpu_unquantized_gemm` when loading Qwen3.5-4B

Bug Report: `ValueError: too many values to unpack` in `dispatch_cpu_unquantized_gemm` when loading Qwen3.5-4B