vllm - 💡(How to fix) Fix [Bug]: Qwen3.5 does not work with pipeline parallelism [9 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36643Fetched 2026-04-08 00:35:43
View on GitHub
Comments
9
Participants
4
Timeline
24
Reactions
0
Author
Timeline (top)
commented ×9subscribed ×8mentioned ×6labeled ×1

Error Message

root@xuanwu-text-safety-qwen3-5-1355630-z8p8q:/data# nohup python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --moe-backend marlin --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' > vllm.log 2>&1 & [1] 443977 root@xuanwu-text-safety-qwen3-5-1355630-z8p8q:/data# tail -f vllm.log
nohup: ignoring input (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.0 (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] █▄█▀ █ █ █ █ model /athena/Qwen3.5-35B-A3B (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] (APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 160000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'moe_backend': 'marlin', 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}, 'enable_log_requests': True} (APIServer pid=443977) INFO 03-10 19:20:41 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration (APIServer pid=443977) INFO 03-10 19:20:41 [model.py:1554] Using max model len 160000 (APIServer pid=443977) WARNING 03-10 19:20:41 [speculative.py:346] method qwen3_next_mtp is deprecated and replaced with mtp. (APIServer pid=443977) INFO 03-10 19:20:41 [model.py:531] Resolved architecture: Qwen3_5MoeMTP (APIServer pid=443977) INFO 03-10 19:20:41 [model.py:1554] Using max model len 262144 (APIServer pid=443977) WARNING 03-10 19:20:41 [speculative.py:487] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate (APIServer pid=443977) Traceback (most recent call last): (APIServer pid=443977) File "<frozen runpy>", line 198, in _run_module_as_main (APIServer pid=443977) File "<frozen runpy>", line 88, in _run_code (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module> (APIServer pid=443977) uvloop.run(run_server(args)) (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=443977) return __asyncio.run( (APIServer pid=443977) ^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run (APIServer pid=443977) return runner.run(main) (APIServer pid=443977) ^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=443977) return self._loop.run_until_complete(task) (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=443977) return await main (APIServer pid=443977) ^^^^^^^^^^ (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=443977) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=443977) async with build_async_engine_client( (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=443977) return await anext(self.gen) (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=443977) async with build_async_engine_client_from_engine_args( (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=443977) return await anext(self.gen) (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args (APIServer pid=443977) vllm_config = engine_args.create_engine_config(usage_context=usage_context) (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1729, in create_engine_config (APIServer pid=443977) speculative_config = self.create_speculative_config( (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1443, in create_speculative_config (APIServer pid=443977) return SpeculativeConfig(**self.speculative_config) (APIServer pid=443977) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in init (APIServer pid=443977) s.pydantic_validator.validate_python(ArgsKwargs(args, kwargs), self_instance=s) (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py", line 764, in _verify_args (APIServer pid=443977) self.draft_model_config.verify_with_parallel_config( (APIServer pid=443977) File "/usr/local/lib/python3.12/dist-packages/vllm/config/model.py", line 1065, in verify_with_parallel_config (APIServer pid=443977) raise NotImplementedError( (APIServer pid=443977) NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the SupportsPP interface.

Code Example

Your output of `python collect_env.py` here

---

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --moe-backend marlin --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

---

root@xuanwu-text-safety-qwen3-5-1355630-z8p8q:/data# nohup python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --moe-backend marlin --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3 --speculative-config  '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' > vllm.log 2>&1 &
[1] 443977
root@xuanwu-text-safety-qwen3-5-1355630-z8p8q:/data# tail -f vllm.log  
nohup: ignoring input
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] 
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] 
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 160000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'moe_backend': 'marlin', 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}, 'enable_log_requests': True}
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:1554] Using max model len 160000
(APIServer pid=443977) WARNING 03-10 19:20:41 [speculative.py:346] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:531] Resolved architecture: Qwen3_5MoeMTP
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:1554] Using max model len 262144
(APIServer pid=443977) WARNING 03-10 19:20:41 [speculative.py:487] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=443977) Traceback (most recent call last):
(APIServer pid=443977)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=443977)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module>
(APIServer pid=443977)     uvloop.run(run_server(args))
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=443977)     return __asyncio.run(
(APIServer pid=443977)            ^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=443977)     return runner.run(main)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=443977)     return self._loop.run_until_complete(task)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=443977)     return await main
(APIServer pid=443977)            ^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=443977)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=443977)     async with build_async_engine_client(
(APIServer pid=443977)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=443977)     return await anext(self.gen)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=443977)     async with build_async_engine_client_from_engine_args(
(APIServer pid=443977)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=443977)     return await anext(self.gen)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
(APIServer pid=443977)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=443977)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1729, in create_engine_config
(APIServer pid=443977)     speculative_config = self.create_speculative_config(
(APIServer pid=443977)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1443, in create_speculative_config
(APIServer pid=443977)     return SpeculativeConfig(**self.speculative_config)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=443977)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py", line 764, in _verify_args
(APIServer pid=443977)     self.draft_model_config.verify_with_parallel_config(
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/config/model.py", line 1065, in verify_with_parallel_config
(APIServer pid=443977)     raise NotImplementedError(
(APIServer pid=443977) NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the `SupportsPP` interface.
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

vllm server bash

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --moe-backend marlin --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

error output

root@xuanwu-text-safety-qwen3-5-1355630-z8p8q:/data# nohup python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --moe-backend marlin --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3 --speculative-config  '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' > vllm.log 2>&1 &
[1] 443977
root@xuanwu-text-safety-qwen3-5-1355630-z8p8q:/data# tail -f vllm.log  
nohup: ignoring input
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] 
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:302] 
(APIServer pid=443977) INFO 03-10 19:20:41 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 160000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'moe_backend': 'marlin', 'speculative_config': {'method': 'qwen3_next_mtp', 'num_speculative_tokens': 2}, 'enable_log_requests': True}
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:1554] Using max model len 160000
(APIServer pid=443977) WARNING 03-10 19:20:41 [speculative.py:346] method `qwen3_next_mtp` is deprecated and replaced with mtp.
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:531] Resolved architecture: Qwen3_5MoeMTP
(APIServer pid=443977) INFO 03-10 19:20:41 [model.py:1554] Using max model len 262144
(APIServer pid=443977) WARNING 03-10 19:20:41 [speculative.py:487] Enabling num_speculative_tokens > 1 will run multiple times of forward on same MTP layer,which may result in lower acceptance rate
(APIServer pid=443977) Traceback (most recent call last):
(APIServer pid=443977)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=443977)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module>
(APIServer pid=443977)     uvloop.run(run_server(args))
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=443977)     return __asyncio.run(
(APIServer pid=443977)            ^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=443977)     return runner.run(main)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=443977)     return self._loop.run_until_complete(task)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=443977)     return await main
(APIServer pid=443977)            ^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=443977)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=443977)     async with build_async_engine_client(
(APIServer pid=443977)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=443977)     return await anext(self.gen)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=443977)     async with build_async_engine_client_from_engine_args(
(APIServer pid=443977)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=443977)     return await anext(self.gen)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 122, in build_async_engine_client_from_engine_args
(APIServer pid=443977)     vllm_config = engine_args.create_engine_config(usage_context=usage_context)
(APIServer pid=443977)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1729, in create_engine_config
(APIServer pid=443977)     speculative_config = self.create_speculative_config(
(APIServer pid=443977)                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/engine/arg_utils.py", line 1443, in create_speculative_config
(APIServer pid=443977)     return SpeculativeConfig(**self.speculative_config)
(APIServer pid=443977)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/pydantic/_internal/_dataclasses.py", line 121, in __init__
(APIServer pid=443977)     s.__pydantic_validator__.validate_python(ArgsKwargs(args, kwargs), self_instance=s)
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/config/speculative.py", line 764, in _verify_args
(APIServer pid=443977)     self.draft_model_config.verify_with_parallel_config(
(APIServer pid=443977)   File "/usr/local/lib/python3.12/dist-packages/vllm/config/model.py", line 1065, in verify_with_parallel_config
(APIServer pid=443977)     raise NotImplementedError(
(APIServer pid=443977) NotImplementedError: Pipeline parallelism is not supported for this model. Supported models implement the `SupportsPP` interface.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The error indicates that pipeline parallelism is not supported for the specified model. To fix this issue, we need to adjust the model configuration to either use a model that supports pipeline parallelism or disable pipeline parallelism.

Here are the steps to fix the issue:

  • Check if the model Qwen3.5-35B-A3B supports pipeline parallelism by referring to the model documentation.
  • If the model does not support pipeline parallelism, set --pipeline-parallel-size to 1 to disable pipeline parallelism.
  • Alternatively, use a different model that supports pipeline parallelism.

Example code changes:

# Disable pipeline parallelism
python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 1 --moe-backend marlin --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

Verification

To verify that the fix worked, run the API server with the updated configuration and check the logs for any errors related to pipeline parallelism.

Extra Tips

  • Make sure to check the model documentation for supported parallelism configurations.
  • If using a different model, ensure it is compatible with the rest of the configuration.
  • Monitor the API server logs for any issues related to parallelism or model configuration.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING