vllm - ✅(Solved) Fix [Bug]: model with GGUF quant type failed to run [1 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41475Fetched 2026-05-02 05:27:56
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1

Error Message

(APIServer pid=1) Traceback (most recent call last):

Fix Action

Fix / Workaround

WARNING 05-01 19:13:17 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version. (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.0 (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] █▄█▀ █ █ █ █ model hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:233] non-default args: {'model_tag': 'hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M', 'default_chat_template_kwargs': {'enable_thinking': False}, 'model': 'hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M', 'max_model_len': 102400, 'gpu_memory_utilization': 0.9, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 24} (APIServer pid=1) Traceback (most recent call last): (APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1) sys.exit(main()) (APIServer pid=1) ^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=1) args.dispatch_function(args) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=1) uvloop.run(run_server(args)) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=1) return asyncio.run( (APIServer pid=1) ^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1) return runner.run(main) (APIServer pid=1) ^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1) return self.loop.run_until_complete(task) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=1) return await main (APIServer pid=1) ^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker (APIServer pid=1) async with build_async_engine_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=1) async with build_async_engine_client_from_engine_args( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args (APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1617, in create_engine_config (APIServer pid=1) maybe_override_with_speculators( (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py", line 598, in maybe_override_with_speculators (APIServer pid=1) config_dict, _ = PretrainedConfig.get_config_dict( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/modelscope/utils/hf_util/patcher.py", line 209, in patch_get_config_dict (APIServer pid=1) model_dir = get_model_dir(pretrained_model_name_or_path, (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/modelscope/utils/hf_util/patcher.py", line 184, in get_model_dir (APIServer pid=1) model_dir = snapshot_download( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/modelscope/hub/snapshot_download.py", line 139, in snapshot_download (APIServer pid=1) repo_id.replace('/', '')) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) TypeError: Path.replace() takes 2 positional arguments but 3 were given

PR fix notes

PR #41488: [codex] Fix ModelScope config probing for GGUF quant model ids

Description (problem / solution / changelog)

Purpose

Fixes #41475. Serving a GGUF model id with a quant suffix such as repo:Q4_K_M under VLLM_USE_MODELSCOPE=True crashes during speculator-config probing before model loading.

Root cause

maybe_override_with_speculators() split remote GGUF selectors but wrapped the repo id with Path(repo_id) before calling PretrainedConfig.get_config_dict(...). In the ModelScope-patched path, that value is treated as a repo id and ModelScope calls repo_id.replace("/", "___"), dispatching to Path.replace() instead of str.replace().

Fix

For remote GGUF repo:quant selectors, pass the split repo id as a plain string to the config probe. The original quant-suffixed model value is preserved when no speculators config exists, so later GGUF loading still sees the selected quant type.

Tests

  • ruff check vllm/transformers_utils/config.py tests/transformers_utils/test_config.py
  • ruff format --check vllm/transformers_utils/config.py tests/transformers_utils/test_config.py
  • PYTHONPATH=. uv run --no-project ... pytest -q tests/transformers_utils/test_config.py -k speculators --confcutdir=tests/transformers_utils -> 1 passed

Notes

This does not download or load the large GGUF model in tests; the regression is isolated at config-probing time.

Changed files

  • tests/transformers_utils/test_config.py (modified, +46/-1)
  • vllm/transformers_utils/config.py (modified, +1/-1)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

docker vllm/vllm-openai:v0.20.0

</details>

🐛 Describe the bug

When I run docker with follow cmd:

docker run -d
--name vllm
--gpus all
--ipc=host
--shm-size=16g
-e NCCL_P2P_DISABLE=0
-e NCCL_IB_DISABLE=1
-e VLLM_USE_MODELSCOPE=True
-p 5000:5000
vllm/vllm-openai:v0.20.0
--model hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M
--max-model-len 102400
--kv-cache-dtype fp8
--gpu-memory-utilization 0.9
--max-num-seqs 24
--max-num-batched-tokens 8192
--language-model-only
--enable-prefix-caching
--default-chat-template-kwargs '{"enable_thinking":false}'

it fails to run. Here is the log:

WARNING 05-01 19:13:17 [argparse_utils.py:257] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in a future version. (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.0 (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] █▄█▀ █ █ █ █ model hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:299] (APIServer pid=1) INFO 05-01 19:13:17 [utils.py:233] non-default args: {'model_tag': 'hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M', 'default_chat_template_kwargs': {'enable_thinking': False}, 'model': 'hesamation/Qwen3.6-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled-GGUF:Q4_K_M', 'max_model_len': 102400, 'gpu_memory_utilization': 0.9, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 8192, 'max_num_seqs': 24} (APIServer pid=1) Traceback (most recent call last): (APIServer pid=1) File "/usr/local/bin/vllm", line 10, in <module> (APIServer pid=1) sys.exit(main()) (APIServer pid=1) ^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=1) args.dispatch_function(args) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=1) uvloop.run(run_server(args)) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=1) return asyncio.run( (APIServer pid=1) ^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=1) return runner.run(main) (APIServer pid=1) ^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=1) return self.loop.run_until_complete(task) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=1) return await main (APIServer pid=1) ^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 678, in run_server (APIServer pid=1) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 692, in run_server_worker (APIServer pid=1) async with build_async_engine_client( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=1) async with build_async_engine_client_from_engine_args( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=1) return await anext(self.gen) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 124, in build_async_engine_client_from_engine_args (APIServer pid=1) vllm_config = engine_args.create_engine_config(usage_context=usage_context) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1617, in create_engine_config (APIServer pid=1) maybe_override_with_speculators( (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/vllm/transformers_utils/config.py", line 598, in maybe_override_with_speculators (APIServer pid=1) config_dict, _ = PretrainedConfig.get_config_dict( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/modelscope/utils/hf_util/patcher.py", line 209, in patch_get_config_dict (APIServer pid=1) model_dir = get_model_dir(pretrained_model_name_or_path, (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/modelscope/utils/hf_util/patcher.py", line 184, in get_model_dir (APIServer pid=1) model_dir = snapshot_download( (APIServer pid=1) ^^^^^^^^^^^^^^^^^^ (APIServer pid=1) File "/usr/local/lib/python3.12/dist-packages/modelscope/hub/snapshot_download.py", line 139, in snapshot_download (APIServer pid=1) repo_id.replace('/', '')) (APIServer pid=1) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) TypeError: Path.replace() takes 2 positional arguments but 3 were given

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to an incompatibility between the modelscope library and the vllm version, causing a TypeError when trying to download the model snapshot.

Guidance

  • The error message indicates a TypeError in the snapshot_download function, which suggests that there might be an issue with the model download process.
  • The modelscope library is being used to download the model, and the error occurs when trying to replace the / character in the repository ID.
  • To mitigate this issue, you can try updating the modelscope library to the latest version or checking the compatibility of the vllm version with the modelscope library.
  • You can also try to download the model manually and specify the model path in the vllm command instead of using the --model option.

Example

No code snippet is provided as the issue is related to a library compatibility problem.

Notes

The issue might be specific to the vllm version 0.20.0 and the modelscope library version. Further investigation is needed to determine the root cause of the problem.

Recommendation

Apply workaround: Try updating the modelscope library to the latest version or download the model manually and specify the model path in the vllm command.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING