vllm - 💡(How to fix) Fix [Bug]: can't start b200x2 or b200x4 sm100 with nvidia/Qwen3.5-397B-A17B-NVFP4 [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38550Fetched 2026-04-08 01:53:24
View on GitHub
Comments
1
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
subscribed ×2commented ×1labeled ×1mentioned ×1

Error Message

warnings.warn( warnings.warn( warnings.warn( warnings.warn(

Code Example

vllm serve /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 --served-model-name qwen397b-nvfp4 --tensor-parallel-size 2 --trust-remote-code --gpu-m
emory-utilization 0.90 --max-model-len 131072 --chat-template-content-format string --enable-auto-tool-choice --tool-call-parser qwen3_coder --host 0.0.0.0 --port 30000
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]   █▄█▀ █     █     █     █  model   /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:233] non-default args: {'model_tag': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'chat_template_content_format': 'string', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 30000, 'model': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'trust_remote_code': True, 'max_model_len': 131072, 'served_model_name': ['qwen397b-nvfp4'], 'tensor_parallel_size': 2}
(APIServer pid=2838675) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:1582] Using max model len 131072
(APIServer pid=2838675) INFO 03-30 17:09:53 [cache.py:212] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=2838675) INFO 03-30 17:09:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:212] Setting attention block size to 4176 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:243] Padding mamba page size by 0.19% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=2838675) WARNING 03-30 17:09:58 [modelopt.py:995] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=2838675) INFO 03-30 17:09:58 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=2838675) INFO 03-30 17:09:58 [compilation.py:289] Enabled custom fusions: act_quant, allreduce_rms
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(EngineCore pid=2838941) INFO 03-30 17:10:07 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', speculative_config=None, tokenizer='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen397b-nvfp4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=2838941) WARNING 03-30 17:10:07 [multiproc_executor.py:997] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=2838941) INFO 03-30 17:10:07 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.253.132.10 (local), world_size=2, local_world_size=2
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(Worker pid=2839263) INFO 03-30 17:10:13 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839264) INFO 03-30 17:10:14 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839263) INFO 03-30 17:10:14 [pynccl.py:111] vLLM is using nccl==2.28.9
RAW_BUFFERClick to expand / collapse

Your current environment

<details> I've tried various versions of vLLM — 0.17.0, 0.17.1, and 0.18.0 — but none of them work; the process just hangs and never starts </details>

🐛 Describe the bug

I've tried various versions of vLLM — 0.17.0, 0.17.1, and 0.18.0 — but none of them work; the process just hangs and never starts

vllm serve /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 --served-model-name qwen397b-nvfp4 --tensor-parallel-size 2 --trust-remote-code --gpu-m
emory-utilization 0.90 --max-model-len 131072 --chat-template-content-format string --enable-auto-tool-choice --tool-call-parser qwen3_coder --host 0.0.0.0 --port 30000
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]   █▄█▀ █     █     █     █  model   /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:233] non-default args: {'model_tag': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'chat_template_content_format': 'string', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 30000, 'model': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'trust_remote_code': True, 'max_model_len': 131072, 'served_model_name': ['qwen397b-nvfp4'], 'tensor_parallel_size': 2}
(APIServer pid=2838675) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:1582] Using max model len 131072
(APIServer pid=2838675) INFO 03-30 17:09:53 [cache.py:212] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=2838675) INFO 03-30 17:09:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:212] Setting attention block size to 4176 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:243] Padding mamba page size by 0.19% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=2838675) WARNING 03-30 17:09:58 [modelopt.py:995] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=2838675) INFO 03-30 17:09:58 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=2838675) INFO 03-30 17:09:58 [compilation.py:289] Enabled custom fusions: act_quant, allreduce_rms
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(EngineCore pid=2838941) INFO 03-30 17:10:07 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', speculative_config=None, tokenizer='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen397b-nvfp4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=2838941) WARNING 03-30 17:10:07 [multiproc_executor.py:997] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=2838941) INFO 03-30 17:10:07 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.253.132.10 (local), world_size=2, local_world_size=2
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(Worker pid=2839263) INFO 03-30 17:10:13 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839264) INFO 03-30 17:10:14 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839263) INFO 03-30 17:10:14 [pynccl.py:111] vLLM is using nccl==2.28.9

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue of the process hanging and never starting, we'll focus on the key aspects that could be causing the problem. Given the information provided, the issue seems to be related to the configuration and environment setup for running the vLLM model. Here are the steps to follow:

  1. Check Environment Variables: Ensure that the necessary environment variables are set correctly. Specifically, check TRANSFORMERS_CACHE and consider replacing it with HF_HOME as suggested by the deprecation warning.
  2. Torch Parallelism: The warning about reducing Torch parallelism from 112 threads to 1 to avoid CPU contention suggests that adjusting OMP_NUM_THREADS could help. Try setting OMP_NUM_THREADS to a lower value (e.g., 4 or 8) to see if it improves the situation.
  3. Dependency Versions: Verify that all dependencies, including transformers, torch, and nccl, are up to date and compatible with each other.
  4. Model and Data Path: Confirm that the model path /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 and any data paths are correct and accessible.
  5. Configuration Adjustments: Review the configuration settings passed to vllm serve, especially --tensor-parallel-size 2 and --gpu-memory-utilization 0.90. Adjust these settings if necessary to better match your hardware capabilities.

Example Code Snippets

No specific code changes are suggested directly from the issue description. However, ensuring your vllm serve command is correctly formatted and that you're using the right environment variables is crucial. For example:

export HF_HOME=/path/to/hf/home
export OMP_NUM_THREADS=4

vllm serve /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 --served-model-name qwen397b-nvfp4 --tensor-parallel-size 2 --trust-remote-code --gpu-memory-utilization 0.90 --max-model-len 131072 --chat-template-content-format string --enable-auto-tool-choice --tool-call-parser qwen3_coder --host 0.0.0.0 --port 30000

Verification

After applying these changes, restart the vllm serve process and monitor its output for any changes in behavior. Check for successful initialization and the absence of hanging.

Extra Tips

  • Regularly update your dependencies to ensure you have the latest fixes and features.
  • Consider testing with a smaller model or different configuration settings to isolate the issue.
  • If the problem persists, providing more details about your environment, such as GPU specifications and driver versions, could help in diagnosing the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING