vllm - 💡(How to fix) Fix [Bug]: can't start b200x2 or b200x4 sm100 with nvidia/Qwen3.5-397B-A17B-NVFP4 [1 comments, 1 participants]

vllm2026-03-30 13:31:15

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38550•Fetched 2026-04-08 01:53:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

evgeniiperepelkin

Participants

evgeniiperepelkin

Timeline (top)

subscribed ×2commented ×1labeled ×1mentioned ×1

Error Message

warnings.warn( warnings.warn( warnings.warn( warnings.warn(

Code Example

vllm serve /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 --served-model-name qwen397b-nvfp4 --tensor-parallel-size 2 --trust-remote-code --gpu-m
emory-utilization 0.90 --max-model-len 131072 --chat-template-content-format string --enable-auto-tool-choice --tool-call-parser qwen3_coder --host 0.0.0.0 --port 30000
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]   █▄█▀ █     █     █     █  model   /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:233] non-default args: {'model_tag': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'chat_template_content_format': 'string', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 30000, 'model': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'trust_remote_code': True, 'max_model_len': 131072, 'served_model_name': ['qwen397b-nvfp4'], 'tensor_parallel_size': 2}
(APIServer pid=2838675) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:1582] Using max model len 131072
(APIServer pid=2838675) INFO 03-30 17:09:53 [cache.py:212] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=2838675) INFO 03-30 17:09:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:212] Setting attention block size to 4176 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:243] Padding mamba page size by 0.19% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=2838675) WARNING 03-30 17:09:58 [modelopt.py:995] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=2838675) INFO 03-30 17:09:58 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=2838675) INFO 03-30 17:09:58 [compilation.py:289] Enabled custom fusions: act_quant, allreduce_rms
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(EngineCore pid=2838941) INFO 03-30 17:10:07 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', speculative_config=None, tokenizer='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen397b-nvfp4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=2838941) WARNING 03-30 17:10:07 [multiproc_executor.py:997] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=2838941) INFO 03-30 17:10:07 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.253.132.10 (local), world_size=2, local_world_size=2
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(Worker pid=2839263) INFO 03-30 17:10:13 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839264) INFO 03-30 17:10:14 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839263) INFO 03-30 17:10:14 [pynccl.py:111] vLLM is using nccl==2.28.9

RAW_BUFFERClick to expand / collapse

Your current environment

<details> I've tried various versions of vLLM — 0.17.0, 0.17.1, and 0.18.0 — but none of them work; the process just hangs and never starts </details>

🐛 Describe the bug

I've tried various versions of vLLM — 0.17.0, 0.17.1, and 0.18.0 — but none of them work; the process just hangs and never starts

vllm serve /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 --served-model-name qwen397b-nvfp4 --tensor-parallel-size 2 --trust-remote-code --gpu-m
emory-utilization 0.90 --max-model-len 131072 --chat-template-content-format string --enable-auto-tool-choice --tool-call-parser qwen3_coder --host 0.0.0.0 --port 30000
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]        █     █     █▄   ▄█
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.18.0
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]   █▄█▀ █     █     █     █  model   /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:297] 
(APIServer pid=2838675) INFO 03-30 17:09:53 [utils.py:233] non-default args: {'model_tag': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'chat_template_content_format': 'string', 'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'host': '0.0.0.0', 'port': 30000, 'model': '/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', 'trust_remote_code': True, 'max_model_len': 131072, 'served_model_name': ['qwen397b-nvfp4'], 'tensor_parallel_size': 2}
(APIServer pid=2838675) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored.
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:533] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=2838675) INFO 03-30 17:09:53 [model.py:1582] Using max model len 131072
(APIServer pid=2838675) INFO 03-30 17:09:53 [cache.py:212] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor.
(APIServer pid=2838675) INFO 03-30 17:09:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:212] Setting attention block size to 4176 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=2838675) INFO 03-30 17:09:58 [config.py:243] Padding mamba page size by 0.19% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=2838675) WARNING 03-30 17:09:58 [modelopt.py:995] Detected ModelOpt NVFP4 checkpoint. Please note that the format is experimental and could change in future.
(APIServer pid=2838675) INFO 03-30 17:09:58 [vllm.py:754] Asynchronous scheduling is enabled.
(APIServer pid=2838675) INFO 03-30 17:09:58 [compilation.py:289] Enabled custom fusions: act_quant, allreduce_rms
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(EngineCore pid=2838941) INFO 03-30 17:10:07 [core.py:103] Initializing a V1 LLM engine (v0.18.0) with config: model='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', speculative_config=None, tokenizer='/data/models/nvidia/Qwen3.5-397B-A17B-NVFP4', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=modelopt_fp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8_e4m3, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen397b-nvfp4, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': True}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=2838941) WARNING 03-30 17:10:07 [multiproc_executor.py:997] Reducing Torch parallelism from 112 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed.
(EngineCore pid=2838941) INFO 03-30 17:10:07 [multiproc_executor.py:134] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=10.253.132.10 (local), world_size=2, local_world_size=2
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
/root/venv-final/lib/python3.12/site-packages/transformers/utils/hub.py:110: FutureWarning: Using `TRANSFORMERS_CACHE` is deprecated and will be removed in v5 of Transformers. Use `HF_HOME` instead.
  warnings.warn(
(Worker pid=2839263) INFO 03-30 17:10:13 [parallel_state.py:1395] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839264) INFO 03-30 17:10:14 [parallel_state.py:1395] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:56389 backend=nccl
(Worker pid=2839263) INFO 03-30 17:10:14 [pynccl.py:111] vLLM is using nccl==2.28.9

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue of the process hanging and never starting, we'll focus on the key aspects that could be causing the problem. Given the information provided, the issue seems to be related to the configuration and environment setup for running the vLLM model. Here are the steps to follow:

Check Environment Variables: Ensure that the necessary environment variables are set correctly. Specifically, check TRANSFORMERS_CACHE and consider replacing it with HF_HOME as suggested by the deprecation warning.
Torch Parallelism: The warning about reducing Torch parallelism from 112 threads to 1 to avoid CPU contention suggests that adjusting OMP_NUM_THREADS could help. Try setting OMP_NUM_THREADS to a lower value (e.g., 4 or 8) to see if it improves the situation.
Dependency Versions: Verify that all dependencies, including transformers, torch, and nccl, are up to date and compatible with each other.
Model and Data Path: Confirm that the model path /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 and any data paths are correct and accessible.
Configuration Adjustments: Review the configuration settings passed to vllm serve, especially --tensor-parallel-size 2 and --gpu-memory-utilization 0.90. Adjust these settings if necessary to better match your hardware capabilities.

Example Code Snippets

No specific code changes are suggested directly from the issue description. However, ensuring your vllm serve command is correctly formatted and that you're using the right environment variables is crucial. For example:

export HF_HOME=/path/to/hf/home
export OMP_NUM_THREADS=4

vllm serve /data/models/nvidia/Qwen3.5-397B-A17B-NVFP4 --served-model-name qwen397b-nvfp4 --tensor-parallel-size 2 --trust-remote-code --gpu-memory-utilization 0.90 --max-model-len 131072 --chat-template-content-format string --enable-auto-tool-choice --tool-call-parser qwen3_coder --host 0.0.0.0 --port 30000

Verification

After applying these changes, restart the vllm serve process and monitor its output for any changes in behavior. Check for successful initialization and the absence of hanging.

Extra Tips

Regularly update your dependencies to ensure you have the latest fixes and features.
Consider testing with a smaller model or different configuration settings to isolate the issue.
If the problem persists, providing more details about your environment, such as GPU specifications and driver versions, could help in diagnosing the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: can't start b200x2 or b200x4 sm100 with nvidia/Qwen3.5-397B-A17B-NVFP4 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Example Code Snippets

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: can't start b200x2 or b200x4 sm100 with nvidia/Qwen3.5-397B-A17B-NVFP4 [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Example Code Snippets

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING