vllm - ✅(Solved) Fix [Bug]: Qwen 3.5 fails to load from GGUF [2 pull requests, 3 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38122Fetched 2026-04-08 01:32:10
View on GitHub
Comments
3
Participants
4
Timeline
15
Reactions
3
Timeline (top)
subscribed ×7commented ×3mentioned ×3cross-referenced ×1

Error Message

[2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] EngineCore failed to start. [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] Traceback (most recent call last): [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] return func(*args, **kwargs) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in init [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] super().init( [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in init [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] self.model_executor = executor_class(vllm_config) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] return func(*args, **kwargs) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in init [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] self._init_executor() [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] self.driver_worker.load_model() [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] self.model_runner.load_model(load_dummy_weights=dummy_weights) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] return func(*args, **kwargs) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] self.model = model_loader.load_model( [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 339, in load_model [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] gguf_weights_map = self._get_gguf_weights_map(model_config) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 159, in _get_gguf_weights_map [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] vision_num_layers = config.vision_config.num_hidden_layers [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 207, in getattribute [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] return super().getattribute(key) [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ [2026-03-25 17:30:37] (EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] AttributeError: 'Qwen3_5VisionConfig' object has no attribute 'num_hidden_layers'

Root Cause

I have fixed that locally, but now it's failing because the num_hidden_layers isn't set in the vision config:

[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] EngineCore failed to start.
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] Traceback (most recent call last):
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     super().__init__(
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model_executor = executor_class(vllm_config)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self._init_executor()
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.driver_worker.load_model()
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model_runner.load_model(load_dummy_weights=dummy_weights)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model = model_loader.load_model(
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 339, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     gguf_weights_map = self._get_gguf_weights_map(model_config)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 159, in _get_gguf_weights_map
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     vision_num_layers = config.vision_config.num_hidden_layers
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 207, in __getattribute__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return super().__getattribute__(key)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] AttributeError: 'Qwen3_5VisionConfig' object has no attribute 'num_hidden_layers'

Fix Action

PR fix notes

PR #38140: [Bugfix] Fix Qwen 3.5 GGUF loading: add model type mapping and vision config d…

Description (problem / solution / changelog)

Purpose

Fixes #38122

This PR fixes two bugs that prevent Qwen 3.5 models from loading via GGUF format:

  1. Unknown gguf model_type: qwen3_5 - The GGUF library uses qwen35 as the architecture name, but HuggingFace config uses qwen3_5 with an underscore. This PR adds the necessary model type mapping.

  2. AttributeError: 'Qwen3_5VisionConfig' object has no attribute 'num_hidden_layers' - The Qwen3_5VisionConfig class uses depth instead of num_hidden_layers to represent the number of vision transformer layers. This PR adds a fallback to check for depth when num_hidden_layers is not available.

Changes Made

  • vllm/model_executor/model_loader/gguf_loader.py:

    • Added qwen3_5qwen35 model type mapping (similar to existing qwen3_moeqwen3moe pattern)
    • Added getattr fallback to use depth attribute when num_hidden_layers is missing in vision config
  • tests/models/test_qwen35_gguf.py (new file):

    • Added comprehensive tests for Qwen 3.5 GGUF support

Test Plan

Run the new Qwen 3.5 GGUF tests:

pytest tests/models/test_qwen35_gguf.py -v

Test Result

============================ test session starts ==============================
platform linux -- Python 3.12.13, pytest-8.4.2, pluggy-1.6.0 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /content/vllm
configfile: pyproject.toml
plugins: anyio-4.12.1, langsmith-0.7.18, typeguard-4.5.1
collected 9 items                                                              

vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFSupport::test_qwen35_model_type_mapping PASSED [ 11%]
vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFSupport::test_qwen35_vision_config_depth_fallback PASSED [ 22%]
vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFSupport::test_qwen35_vision_config_custom_depth PASSED [ 33%]
vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFSupport::test_qwen35_text_config_has_num_hidden_layers PASSED [ 44%]
vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFSupport::test_qwen35_full_config_structure PASSED [ 55%]
vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFLoaderIntegration::test_get_gguf_weights_map_with_qwen35_text_only PASSED [ 66%]
vllm/tests/models/test_qwen35_gguf.py::TestQwen35GGUFLoaderIntegration::test_get_gguf_weights_map_with_qwen35_multimodal PASSED [ 77%]
vllm/tests/models/test_qwen35_gguf.py::TestVisionConfigCompatibility::test_fallback_handles_both_attributes PASSED [ 88%]
vllm/tests/models/test_qwen35_gguf.py::TestVisionConfigCompatibility::test_fallback_returns_none_for_missing_attributes PASSED [100%]

=============================== warnings summary ===============================
vllm/vllm/__init__.py:7
  /content/vllm/vllm/__init__.py:7: RuntimeWarning: Failed to read commit hash:
  No module named 'vllm._version'
    from .version import __version__, __version_tuple__  # isort:skip

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================= 9 passed, 1 warning in 6.41s =========================

Manual verification with the original reproduction command:

vllm serve unsloth/Qwen3.5-27B-GGUF:Q4_K_M \
    --host 0.0.0.0 \
    --port 8000 \
    --tokenizer=Qwen/Qwen3.5-27B \
    --hf-config-path=Qwen/Qwen3.5-27B

Before fix: RuntimeError: Unknown gguf model_type: qwen3_5 After fix: Model loads successfully


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/models/test_qwen35_gguf.py (added, +229/-0)
  • vllm/model_executor/model_loader/gguf_loader.py (modified, +19/-2)

PR #39559: [Model] Add GGUF support for Qwen 3.5 dense and MoE models

Description (problem / solution / changelog)

Purpose

Add GGUF support for Qwen 3.5 dense and MoE models

Fixes: #39198, #36456, #38122

Test Plan

# Qwen 3.5 Dense
vllm serve unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS --tokenizer Qwen/Qwen3.5-0.8B --hf-config-path Qwen/Qwen3.5-0.8B
# Qwen 3.5 MoE
vllm serve unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS --tokenizer Qwen/Qwen3.5-35B-A3B-GGUF

Test Result

Qwen3.5 Dense

<details> <summary>before</summary>
vllm serve unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS --tokenizer Qwen/Qwen3.5-0.8B --hf-config-path Qwen/Qwen3.5-0.8B
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:299]
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev122+g83aea2147
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:299]
(APIServer pid=2639330) INFO 04-11 18:49:05 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS', 'model': 'unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS', 'tokenizer': 'Qwen/Qwen3.5-0.8B', 'hf_config_path': 'Qwen/Qwen3.5-0.8B'}
(APIServer pid=2639330) WARNING 04-11 18:49:05 [gguf_utils.py:60] Non-standard GGUF quant type 'UD-IQ2_XXS' detected.
(APIServer pid=2639330) INFO 04-11 18:49:07 [model.py:554] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=2639330) INFO 04-11 18:49:07 [model.py:1684] Using max model len 262144
(APIServer pid=2639330) INFO 04-11 18:49:07 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=2639330) INFO 04-11 18:49:07 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(EngineCore pid=2639990) INFO 04-11 18:49:23 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev122+g83aea2147) with config: model='unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS', speculative_config=None, tokenizer='Qwen/Qwen3.5-0.8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=2639990) INFO 04-11 18:49:25 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:59959 backend=nccl
(EngineCore pid=2639990) INFO 04-11 18:49:25 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=2639990) WARNING 04-11 18:49:25 [gguf_utils.py:60] Non-standard GGUF quant type 'UD-IQ2_XXS' detected.
(EngineCore pid=2639990) INFO 04-11 18:49:34 [gpu_model_runner.py:4735] Starting to load model unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS...
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112] EngineCore failed to start.
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112] Traceback (most recent call last):
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 1086, in run_engine_core
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     return func(*args, **kwargs)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 850, in __init__
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     super().__init__(
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 116, in __init__
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     self.model_executor = executor_class(vllm_config)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     return func(*args, **kwargs)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     self._init_executor()
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     self.driver_worker.load_model()
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     return func(*args, **kwargs)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     self.model = model_loader.load_model(
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/model_executor/model_loader/gguf_loader.py", line 406, in load_model
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     gguf_weights_map = self._get_gguf_weights_map(model_config)
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]   File "/home/name/.test/.gpu/vllm/vllm/model_executor/model_loader/gguf_loader.py", line 204, in _get_gguf_weights_map
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112]     raise RuntimeError(f"Unknown gguf model_type: {model_type}")
(EngineCore pid=2639990) ERROR 04-11 18:49:35 [core.py:1112] RuntimeError: Unknown gguf model_type: qwen3_5
(EngineCore pid=2639990) Process EngineCore:
(EngineCore pid=2639990) Traceback (most recent call last):
(EngineCore pid=2639990)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=2639990)     self.run()
(EngineCore pid=2639990)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=2639990)     self._target(*self._args, **self._kwargs)
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 1116, in run_engine_core
(EngineCore pid=2639990)     raise e
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 1086, in run_engine_core
(EngineCore pid=2639990)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=2639990)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2639990)     return func(*args, **kwargs)
(EngineCore pid=2639990)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 850, in __init__
(EngineCore pid=2639990)     super().__init__(
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core.py", line 116, in __init__
(EngineCore pid=2639990)     self.model_executor = executor_class(vllm_config)
(EngineCore pid=2639990)                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2639990)     return func(*args, **kwargs)
(EngineCore pid=2639990)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/executor/abstract.py", line 109, in __init__
(EngineCore pid=2639990)     self._init_executor()
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/executor/uniproc_executor.py", line 52, in _init_executor
(EngineCore pid=2639990)     self.driver_worker.load_model()
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/worker/gpu_worker.py", line 323, in load_model
(EngineCore pid=2639990)     self.model_runner.load_model(load_dummy_weights=load_dummy_weights)
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=2639990)     return func(*args, **kwargs)
(EngineCore pid=2639990)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/v1/worker/gpu_model_runner.py", line 4751, in load_model
(EngineCore pid=2639990)     self.model = model_loader.load_model(
(EngineCore pid=2639990)                  ^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/model_executor/model_loader/gguf_loader.py", line 406, in load_model
(EngineCore pid=2639990)     gguf_weights_map = self._get_gguf_weights_map(model_config)
(EngineCore pid=2639990)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=2639990)   File "/home/name/.test/.gpu/vllm/vllm/model_executor/model_loader/gguf_loader.py", line 204, in _get_gguf_weights_map
(EngineCore pid=2639990)     raise RuntimeError(f"Unknown gguf model_type: {model_type}")
(EngineCore pid=2639990) RuntimeError: Unknown gguf model_type: qwen3_5
[rank0]:[W411 18:49:36.785516042 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=2639330) Traceback (most recent call last):
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/.venv/bin/vllm", line 10, in <module>
(APIServer pid=2639330)     sys.exit(main())
(APIServer pid=2639330)              ^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=2639330)     args.dispatch_function(args)
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=2639330)     uvloop.run(run_server(args))
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=2639330)     return __asyncio.run(
(APIServer pid=2639330)            ^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=2639330)     return runner.run(main)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=2639330)     return self._loop.run_until_complete(task)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/.venv/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=2639330)     return await main
(APIServer pid=2639330)            ^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 686, in run_server
(APIServer pid=2639330)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 700, in run_server_worker
(APIServer pid=2639330)     async with build_async_engine_client(
(APIServer pid=2639330)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2639330)     return await anext(self.gen)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=2639330)     async with build_async_engine_client_from_engine_args(
(APIServer pid=2639330)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=2639330)     return await anext(self.gen)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=2639330)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=2639330)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=2639330)     return cls(
(APIServer pid=2639330)            ^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=2639330)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=2639330)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2639330)     return func(*args, **kwargs)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=2639330)     return AsyncMPClient(*client_args)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=2639330)     return func(*args, **kwargs)
(APIServer pid=2639330)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core_client.py", line 890, in __init__
(APIServer pid=2639330)     super().__init__(
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/core_client.py", line 551, in __init__
(APIServer pid=2639330)     with launch_core_engines(
(APIServer pid=2639330)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=2639330)   File "/home/name/.local/share/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=2639330)     next(self.gen)
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/utils.py", line 1094, in launch_core_engines
(APIServer pid=2639330)     wait_for_engine_startup(
(APIServer pid=2639330)   File "/home/name/.test/.gpu/vllm/vllm/v1/engine/utils.py", line 1153, in wait_for_engine_startup
(APIServer pid=2639330)     raise RuntimeError(
(APIServer pid=2639330) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
</details> <details> <summary>after</summary>
vllm serve unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS --tokenizer Qwen/Qwen3.5-0.8B --hf-config-path Qwen/Qwen3.5-0.8B
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:299] 
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev164+g55d037e2e.d20260410
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:299] 
(APIServer pid=1622311) INFO 04-13 22:08:32 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS', 'model': 'unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS', 'tokenizer': 'Qwen/Qwen3.5-0.8B', 'hf_config_path': 'Qwen/Qwen3.5-0.8B'}
(APIServer pid=1622311) WARNING 04-13 22:08:32 [gguf_utils.py:62] Non-standard GGUF quant type 'UD-IQ2_XXS' detected.
(APIServer pid=1622311) INFO 04-13 22:08:34 [model.py:554] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1622311) INFO 04-13 22:08:34 [model.py:1684] Using max model len 262144
(APIServer pid=1622311) INFO 04-13 22:08:34 [vllm.py:809] Asynchronous scheduling is enabled.
(APIServer pid=1622311) INFO 04-13 22:08:34 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1622311) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1622311) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=1623438) INFO 04-13 22:08:55 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev164+g55d037e2e.d20260410) with config: model='unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS', speculative_config=None, tokenizer='Qwen/Qwen3.5-0.8B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1623438) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=1623438) INFO 04-13 22:08:58 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:39833 backend=nccl
(EngineCore pid=1623438) INFO 04-13 22:08:58 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=1623438) WARNING 04-13 22:08:58 [gguf_utils.py:62] Non-standard GGUF quant type 'UD-IQ2_XXS' detected.
(EngineCore pid=1623438) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=1623438) INFO 04-13 22:09:11 [gpu_model_runner.py:4750] Starting to load model unsloth/Qwen3.5-0.8B-GGUF:UD-IQ2_XXS...
(EngineCore pid=1623438) The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
(EngineCore pid=1623438) INFO 04-13 22:09:24 [gguf_loader.py:443] Loading extra mm_proj weights from /home/name/.cache/huggingface/hub/models--unsloth--Qwen3.5-0.8B-GGUF/snapshots/6ab461498e2023f6e3c1baea90a8f0fe38ab64d0/mmproj-BF16.gguf...
(EngineCore pid=1623438) INFO 04-13 22:09:24 [cuda.py:422] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=1623438) INFO 04-13 22:09:24 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=1623438) INFO 04-13 22:09:24 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=1623438) INFO 04-13 22:09:25 [cuda.py:366] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=1623438) INFO 04-13 22:09:25 [flash_attn.py:637] Using FlashAttention version 2
(EngineCore pid=1623438) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=1623438) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=1623438) INFO 04-13 22:09:32 [gpu_model_runner.py:4835] Model loading took 0.95 GiB memory and 20.688811 seconds
(EngineCore pid=1623438) INFO 04-13 22:09:32 [interface.py:606] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=1623438) INFO 04-13 22:09:32 [interface.py:630] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=1623438) INFO 04-13 22:09:32 [gpu_model_runner.py:5784] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=1623438) INFO 04-13 22:09:33 [backends.py:1070] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/3ff83d0cde/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1623438) INFO 04-13 22:09:33 [backends.py:1130] Dynamo bytecode transform time: 0.60 s
(EngineCore pid=1623438) INFO 04-13 22:09:34 [backends.py:286] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 0.904 s
(EngineCore pid=1623438) INFO 04-13 22:09:34 [decorators.py:305] Directly load AOT compilation from path /home/name/.cache/vllm/torch_compile_cache/torch_aot_compile/910686eaa28fa02dabfb763dc29f80d7cc4efce33d0e6008f20f0e94258b227f/rank_0_0/model
(EngineCore pid=1623438) INFO 04-13 22:09:34 [monitor.py:48] torch.compile took 1.61 s in total
(EngineCore pid=1623438) INFO 04-13 22:09:35 [monitor.py:76] Initial profiling/warmup run took 0.11 s
(EngineCore pid=1623438) INFO 04-13 22:09:35 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=1623438) INFO 04-13 22:09:35 [gpu_model_runner.py:5914] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=1623438) INFO 04-13 22:09:36 [gpu_model_runner.py:5993] Estimated CUDA graph memory: 0.73 GiB total
(EngineCore pid=1623438) INFO 04-13 22:09:36 [gpu_worker.py:436] Available KV cache memory: 18.76 GiB
(EngineCore pid=1623438) INFO 04-13 22:09:36 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9312 to maintain the same effective KV cache size.
(EngineCore pid=1623438) INFO 04-13 22:09:36 [kv_cache_utils.py:1319] GPU KV cache size: 409,632 tokens
(EngineCore pid=1623438) INFO 04-13 22:09:36 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 6.21x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████| 51/51 [00:01<00:00, 47.86it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████| 35/35 [00:00<00:00, 49.09it/s]
(EngineCore pid=1623438) INFO 04-13 22:09:38 [gpu_model_runner.py:6084] Graph capturing finished in 2 secs, took 0.72 GiB
(EngineCore pid=1623438) INFO 04-13 22:09:38 [gpu_worker.py:597] CUDA graph pool memory: 0.72 GiB (actual), 0.73 GiB (estimated), difference: 0.01 GiB (1.3%).
(EngineCore pid=1623438) INFO 04-13 22:09:38 [core.py:285] init engine (profile, create kv cache, warmup model) took 6.29 seconds
(EngineCore pid=1623438) INFO 04-13 22:09:38 [vllm.py:809] Asynchronous scheduling is enabled.
(EngineCore pid=1623438) INFO 04-13 22:09:38 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1622311) INFO 04-13 22:09:38 [api_server.py:600] Supported tasks: ['generate']
(APIServer pid=1622311) INFO 04-13 22:09:47 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1622311) INFO 04-13 22:09:56 [base.py:245] Multi-modal warmup completed in 8.255s
(APIServer pid=1622311) INFO 04-13 22:09:57 [api_server.py:604] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:37] Available routes are:
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1622311) INFO 04-13 22:09:57 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1622311) INFO:     Started server process [1622311]
(APIServer pid=1622311) INFO:     Waiting for application startup.
(APIServer pid=1622311) INFO:     Application startup complete.
(APIServer pid=1622311) INFO:     127.0.0.1:40480 - "POST /v1/chat/completions HTTP/1.1" 200 OK
</details>

Qwen3.5 MoE

<details> <summary>after</summary>
vllm serve unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS --tokenizer Qwen/Qwen3.5-35B-A3B
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:299]
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev164+g55d037e2e.d20260410
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:299]   █▄█▀ █     █     █     █  model   unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:299]
(APIServer pid=1258756) INFO 04-13 19:49:54 [utils.py:233] non-default args: {'model_tag': 'unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS', 'model': 'unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS', 'tokenizer': 'Qwen/Qwen3.5-35B-A3B'}
(APIServer pid=1258756) WARNING 04-13 19:49:54 [gguf_utils.py:62] Non-standard GGUF quant type 'UD-IQ2_XXS' detected.
(APIServer pid=1258756) INFO 04-13 19:49:56 [gguf_utils.py:334] Forced Qwen3.5 multimodal architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1258756) INFO 04-13 19:49:56 [model.py:554] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=1258756) INFO 04-13 19:49:56 [model.py:1684] Using max model len 262144
(APIServer pid=1258756) INFO 04-13 19:49:56 [vllm.py:809] Asynchronous scheduling is enabled.
(APIServer pid=1258756) INFO 04-13 19:49:56 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1258756) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(APIServer pid=1258756) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=1259857) INFO 04-13 19:50:17 [core.py:107] Initializing a V1 LLM engine (v0.19.1rc1.dev164+g55d037e2e.d20260410) with config: model='unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS', speculative_config=None, tokenizer='Qwen/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=gguf, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=gguf, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=1259857) `Qwen2VLImageProcessorFast` is deprecated. The `Fast` suffix for image processors has been removed; use `Qwen2VLImageProcessor` instead.
(EngineCore pid=1259857) INFO 04-13 19:50:19 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://172.16.1.10:51161 backend=nccl
(EngineCore pid=1259857) INFO 04-13 19:50:20 [parallel_state.py:1713] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=1259857) WARNING 04-13 19:50:20 [gguf_utils.py:62] Non-standard GGUF quant type 'UD-IQ2_XXS' detected.
(EngineCore pid=1259857) The `use_fast` parameter is deprecated and will be removed in a future version. Use `backend="torchvision"` instead of `use_fast=True`, or `backend="pil"` instead of `use_fast=False`.
(EngineCore pid=1259857) INFO 04-13 19:50:32 [gpu_model_runner.py:4750] Starting to load model unsloth/Qwen3.5-35B-A3B-GGUF:UD-IQ2_XXS...
(EngineCore pid=1259857) The fast path is not available because one of the required library is not installed. Falling back to torch implementation. To install follow https://github.com/fla-org/flash-linear-attention#installation and https://github.com/Dao-AILab/causal-conv1d
(EngineCore pid=1259857) INFO 04-13 19:50:47 [gguf_loader.py:456] Loading extra mm_proj weights from /home/name/.cache/huggingface/hub/models--unsloth--Qwen3.5-35B-A3B-GGUF/snapshots/bc014a17be43adabd7066b7a86075ff935c6a4e2/mmproj-BF16.gguf...
(EngineCore pid=1259857) INFO 04-13 19:50:47 [cuda.py:422] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore pid=1259857) INFO 04-13 19:50:47 [mm_encoder_attention.py:230] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore pid=1259857) INFO 04-13 19:50:47 [gdn_linear_attn.py:155] Using Triton/FLA GDN prefill kernel
(EngineCore pid=1259857) INFO 04-13 19:50:47 [cuda.py:366] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore pid=1259857) INFO 04-13 19:50:47 [flash_attn.py:637] Using FlashAttention version 2
(EngineCore pid=1259857) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore pid=1259857) <frozen importlib._bootstrap_external>:1301: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore pid=1259857) INFO 04-13 19:50:57 [gpu_model_runner.py:4835] Model loading took 12.21 GiB memory and 23.596208 seconds
(EngineCore pid=1259857) INFO 04-13 19:50:57 [interface.py:606] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(EngineCore pid=1259857) INFO 04-13 19:50:57 [interface.py:630] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(EngineCore pid=1259857) INFO 04-13 19:50:57 [gpu_model_runner.py:5784] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore pid=1259857) INFO 04-13 19:51:01 [backends.py:1070] Using cache directory: /home/name/.cache/vllm/torch_compile_cache/2e45e00b32/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=1259857) INFO 04-13 19:51:01 [backends.py:1130] Dynamo bytecode transform time: 3.86 s
(EngineCore pid=1259857) INFO 04-13 19:51:03 [backends.py:373] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=1259857) INFO 04-13 19:51:15 [backends.py:391] Compiling a graph for compile range (1, 2048) takes 13.76 s
(EngineCore pid=1259857) INFO 04-13 19:51:18 [decorators.py:655] saved AOT compiled function to /home/name/.cache/vllm/torch_compile_cache/torch_aot_compile/703a7fd7463c0d2c8352e4b42e821c7d228ba254269fa8c7310bbeda6ae7ffa0/rank_0_0/model
(EngineCore pid=1259857) INFO 04-13 19:51:18 [monitor.py:48] torch.compile took 20.01 s in total
(EngineCore pid=1259857) INFO 04-13 19:51:20 [monitor.py:76] Initial profiling/warmup run took 2.18 s
(EngineCore pid=1259857) INFO 04-13 19:51:20 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=1259857) INFO 04-13 19:51:20 [gpu_model_runner.py:5914] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=1259857) INFO 04-13 19:51:23 [gpu_model_runner.py:5993] Estimated CUDA graph memory: 1.62 GiB total
(EngineCore pid=1259857) INFO 04-13 19:51:24 [gpu_worker.py:436] Available KV cache memory: 6.9 GiB
(EngineCore pid=1259857) INFO 04-13 19:51:24 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9000 to 0.9690 to maintain the same effective KV cache size.
(EngineCore pid=1259857) INFO 04-13 19:51:24 [kv_cache_utils.py:1319] GPU KV cache size: 89,760 tokens
(EngineCore pid=1259857) INFO 04-13 19:51:24 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 1.36x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████| 51/51 [00:10<00:00,  4.86it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████| 35/35 [00:04<00:00,  7.02it/s]
(EngineCore pid=1259857) INFO 04-13 19:51:40 [gpu_model_runner.py:6084] Graph capturing finished in 16 secs, took 1.71 GiB
(EngineCore pid=1259857) INFO 04-13 19:51:40 [gpu_worker.py:597] CUDA graph pool memory: 1.71 GiB (actual), 1.62 GiB (estimated), difference: 0.08 GiB (4.8%).
(EngineCore pid=1259857) INFO 04-13 19:51:40 [core.py:285] init engine (profile, create kv cache, warmup model) took 43.22 seconds
(EngineCore pid=1259857) INFO 04-13 19:51:40 [vllm.py:809] Asynchronous scheduling is enabled.
(EngineCore pid=1259857) INFO 04-13 19:51:40 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native'])
(APIServer pid=1258756) INFO 04-13 19:51:40 [api_server.py:600] Supported tasks: ['generate']
(APIServer pid=1258756) INFO 04-13 19:51:47 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1258756) INFO 04-13 19:51:55 [base.py:245] Multi-modal warmup completed in 8.099s
(APIServer pid=1258756) INFO 04-13 19:51:55 [api_server.py:604] Starting vLLM server on http://0.0.0.0:8000/
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:37] Available routes are:
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=1258756) INFO 04-13 19:51:55 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=1258756) INFO:     Started server process [1258756]
(APIServer pid=1258756) INFO:     Waiting for application startup.
(APIServer pid=1258756) INFO:     Application startup complete.
(APIServer pid=1258756) INFO:     127.0.0.1:33702 - "POST /v1/chat/completions HTTP/1.1" 200 OK
</details>
<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • tests/models/multimodal/generation/test_multimodal_gguf.py (modified, +73/-2)
  • vllm/model_executor/layers/linear.py (modified, +26/-6)
  • vllm/model_executor/model_loader/gguf_loader.py (modified, +233/-12)
  • vllm/transformers_utils/gguf_utils.py (modified, +25/-0)

Code Example

Environment:
      PYTHONHASHSEED:                   123
      HF_HOME:                          /data
      POD_IP:                            (v1:status.podIP)
      PROMETHEUS_MULTIPROC_DIR:         /tmp
      HF_TOKEN:                         <set to the key 'token' in secret 'hf-token'>  Optional: false
      VLLM_NO_USAGE_STATS:              1
      VLLM_ENGINE_ITERATION_TIMEOUT_S:  300
      VLLM_DO_NOT_TRACK:                1

---

Your output of `python collect_env.py` here

---

vllm
      serve
      unsloth/Qwen3.5-27B-GGUF:Q4_K_M
      --host
      0.0.0.0
      --port
      8000
      --no-enable-prefix-caching
      --max-model-len
      48000
      --gpu_memory_utilization
      0.9
      --tokenizer=Qwen/Qwen3.5-27B
      --hf-config-path=Qwen/Qwen3.5-27B
      --served-model-name=Qwen/Qwen3.5-27B
      --reasoning-parser=qwen3
      --mm-encoder-tp-mode=data
      --mm-processor-cache-type=shm
      --enable-auto-tool-choice
      --tool-call-parser=qwen3_coder
      --trust-remote-code

---

[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] EngineCore failed to start.
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] Traceback (most recent call last):
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     super().__init__(
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model_executor = executor_class(vllm_config)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self._init_executor()
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.driver_worker.load_model()
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model_runner.load_model(load_dummy_weights=dummy_weights)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model = model_loader.load_model(
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 339, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     gguf_weights_map = self._get_gguf_weights_map(model_config)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 159, in _get_gguf_weights_map
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     vision_num_layers = config.vision_config.num_hidden_layers
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 207, in __getattribute__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return super().__getattribute__(key)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] AttributeError: 'Qwen3_5VisionConfig' object has no attribute 'num_hidden_layers'
RAW_BUFFERClick to expand / collapse

Your current environment

I can't get debug information from the container in my cluster as it constantly crashes and so I can't get a shell into it, but I'm just using the docker.io/vllm/vllm-openai image with following env:

Environment:
      PYTHONHASHSEED:                   123
      HF_HOME:                          /data
      POD_IP:                            (v1:status.podIP)
      PROMETHEUS_MULTIPROC_DIR:         /tmp
      HF_TOKEN:                         <set to the key 'token' in secret 'hf-token'>  Optional: false
      VLLM_NO_USAGE_STATS:              1
      VLLM_ENGINE_ITERATION_TIMEOUT_S:  300
      VLLM_DO_NOT_TRACK:                1

I'm also quite sure I've already identified the issues, but don't quite know how to fix them myself.

<!-- <details> <summary>The output of <code>python collect_env.py</code></summary> ```text Your output of `python collect_env.py` here ``` </details> -->

🐛 Describe the bug

When trying to load Qwen 3.5 from GGUF with the following command:

vllm
      serve
      unsloth/Qwen3.5-27B-GGUF:Q4_K_M
      --host
      0.0.0.0
      --port
      8000
      --no-enable-prefix-caching
      --max-model-len
      48000
      --gpu_memory_utilization
      0.9
      --tokenizer=Qwen/Qwen3.5-27B
      --hf-config-path=Qwen/Qwen3.5-27B
      --served-model-name=Qwen/Qwen3.5-27B
      --reasoning-parser=qwen3
      --mm-encoder-tp-mode=data
      --mm-processor-cache-type=shm
      --enable-auto-tool-choice
      --tool-call-parser=qwen3_coder
      --trust-remote-code

It fails with "Unknown gguf model_type: qwen3_5", here it seems like it need the same replace workaround as qwen3_moe for example, as it's qwen35 in gguf: https://github.com/vllm-project/vllm/blob/6e37c46b35e2ee799fb280180f4d582219bea3f0/vllm/model_executor/model_loader/gguf_loader.py#L128-L129 https://github.com/vllm-project/vllm/blob/6e37c46b35e2ee799fb280180f4d582219bea3f0/vllm/model_executor/model_loader/gguf_loader.py#L149-L155

I have fixed that locally, but now it's failing because the num_hidden_layers isn't set in the vision config:

[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] EngineCore failed to start.
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] Traceback (most recent call last):
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1073, in run_engine_core
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 839, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     super().__init__(
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 112, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model_executor = executor_class(vllm_config)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/abstract.py", line 103, in __init__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self._init_executor()
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/uniproc_executor.py", line 50, in _init_executor
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.driver_worker.load_model()
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 335, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model_runner.load_model(load_dummy_weights=dummy_weights)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return func(*args, **kwargs)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4497, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     self.model = model_loader.load_model(
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                  ^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 339, in load_model
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     gguf_weights_map = self._get_gguf_weights_map(model_config)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/model_loader/gguf_loader.py", line 159, in _get_gguf_weights_map
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     vision_num_layers = config.vision_config.num_hidden_layers
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/transformers/configuration_utils.py", line 207, in __getattribute__
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]     return super().__getattribute__(key)
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
[2026-03-25  17:30:37]	(EngineCore pid=110) ERROR 03-25 16:30:37 [core.py:1099] AttributeError: 'Qwen3_5VisionConfig' object has no attribute 'num_hidden_layers'

I think there's just something missing or it being missing has to be handled with some default value in: https://github.com/vllm-project/vllm/blob/6e37c46b35e2ee799fb280180f4d582219bea3f0/vllm/model_executor/model_loader/gguf_loader.py#L161

I would like to fix it myself, but I don't know how many hidden layers that part of the model has or how to find out. I'm not that deep into the actual ML architecture.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, you need to add a default value for num_hidden_layers in the Qwen3_5VisionConfig object.

Here are the steps:

  • Check the model architecture to determine the correct number of hidden layers.
  • Modify the gguf_loader.py file to include a default value for num_hidden_layers if it is missing from the Qwen3_5VisionConfig object.

Example code:

# In gguf_loader.py
if not hasattr(config.vision_config, 'num_hidden_layers'):
    config.vision_config.num_hidden_layers = 12  # Replace 12 with the correct number of hidden layers
vision_num_layers = config.vision_config.num_hidden_layers

Alternatively, you can also add a try-except block to handle the AttributeError exception:

# In gguf_loader.py
try:
    vision_num_layers = config.vision_config.num_hidden_layers
except AttributeError:
    vision_num_layers = 12  # Replace 12 with the correct number of hidden layers

Verification

To verify that the fix worked, run the vllm serve command again with the modified gguf_loader.py file. If the issue is resolved, the model should load successfully without any errors.

Extra Tips

  • Make sure to update the num_hidden_layers value to the correct number of hidden layers in the Qwen3_5VisionConfig object.
  • If you are not sure about the number of hidden layers, you can check the model architecture documentation or contact the model authors for more information.
  • Consider submitting a pull request to the vllm-project/vllm repository to include the default value for num_hidden_layers in the Qwen3_5VisionConfig object.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING