vllm - ✅(Solved) Fix [Bug]: Why does setting `--pipeline-parallel-size > 1` result in an OOM error, but `--tensor-parallel-size> 1` does not? [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36861Fetched 2026-04-08 00:34:09
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2labeled ×1

Error Message

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 20000 --max-num-batched-tokens 512 --max-num-seqs 8 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice -- tool-call-parser qwen3_coder --reasoning-parser qwen3 (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.0 (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] █▄█▀ █ █ █ █ model /athena/Qwen3.5-35B-A3B (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] (APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 20000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'max_num_batched_tokens': 512, 'max_num_seqs': 8, 'enable_log_requests': True} (APIServer pid=38276) INFO 03-11 11:57:52 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration (APIServer pid=38276) INFO 03-11 11:57:53 [model.py:1554] Using max model len 20000 (APIServer pid=38276) INFO 03-11 11:57:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=512. (APIServer pid=38276) INFO 03-11 11:57:54 [config.py:544] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=38276) INFO 03-11 11:57:54 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=38276) WARNING 03-11 11:57:54 [vllm.py:736] Async scheduling will be disabled because it is not supported with the ray distributed executor backend (only mp, uni, and external_launcher are supported). (APIServer pid=38276) INFO 03-11 11:57:54 [vllm.py:747] Asynchronous scheduling is disabled. (EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=38456) WARNING 03-11 11:58:06 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available. (EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,185 INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS (EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,187 INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397... (EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,199 INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 (EngineCore_DP0 pid=38456) /usr/local/lib/python3.12/dist-packages/ray/private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0 (EngineCore_DP0 pid=38456) warnings.warn( (EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [ray_utils.py:417] No current placement group found. Creating a new placement group. (EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:100] Env var prefixes to copy: ['HF', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_'] (EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD'] (EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) WARNING 03-11 11:58:12 [worker_base.py:301] Missing shared_worker_lock argument from executor. This argument is needed for mm_processor_cache_type='shm'. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:12 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [pynccl.py:111] vLLM is using nccl==2.27.5 (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:18 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.) (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:12 [worker_base.py:301] Missing shared_worker_lock argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1393] world_size=4 rank=2 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 2, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B... (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [unquantized.py:186] Using TRITON backend for Unquantized MoE (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2 Loading safetensors checkpoint shards: 0% Completed | 0/14 [00:00<?, ?it/s] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster] Loading safetensors checkpoint shards: 14% Completed | 2/14 [00:00<00:02, 5.21it/s] Loading safetensors checkpoint shards: 21% Completed | 3/14 [00:00<00:03, 3.50it/s] Loading safetensors checkpoint shards: 29% Completed | 4/14 [00:01<00:02, 3.79it/s] Loading safetensors checkpoint shards: 43% Completed | 6/14 [00:01<00:01, 5.54it/s] Loading safetensors checkpoint shards: 50% Completed | 7/14 [00:01<00:01, 4.25it/s] Loading safetensors checkpoint shards: 57% Completed | 8/14 [00:02<00:01, 3.57it/s] Loading safetensors checkpoint shards: 64% Completed | 9/14 [00:02<00:01, 3.67it/s] Loading safetensors checkpoint shards: 71% Completed | 10/14 [00:02<00:01, 3.25it/s] Loading safetensors checkpoint shards: 86% Completed | 12/14 [00:02<00:00, 4.17it/s] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.31 seconds Loading safetensors checkpoint shards: 93% Completed | 13/14 [00:03<00:00, 3.64it/s] Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00, 4.05it/s] Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00, 3.97it/s] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [gpu_model_runner.py:4338] Model loading took 17.58 GiB memory and 3.468356 seconds (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/b0831fa56d/rank_3_0/backbone for vLLM's torch.compile (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.30 s (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.60 seconds [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:24 [gpu_model_runner.py:4338] Model loading took 17.59 GiB memory and 3.702975 seconds [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:39 [backends.py:350] Cache the graph of compile range (1, 512) for later use (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:41 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/108f95b37d/rank_2_0/backbone for vLLM's torch.compile [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.20 s [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [backends.py:366] Compiling a graph for compile range (1, 512) takes 7.18 s (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [monitor.py:35] torch.compile takes 9.55 s in total (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:40 [backends.py:350] Cache the graph of compile range (1, 512) for later use [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [gpu_worker.py:424] Available KV cache memory: 0.34 GiB (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:43 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json [repeated 3x across cluster] (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] EngineCore failed to start. (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] Traceback (most recent call last): (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] super().init( (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in init (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] kv_cache_configs = get_kv_cache_configs( (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] _check_enough_kv_cache_memory( (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] raise ValueError( (EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. (EngineCore_DP0 pid=38456) Process EngineCore_DP0: (EngineCore_DP0 pid=38456) Traceback (most recent call last): (EngineCore_DP0 pid=38456) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=38456) self.run() (EngineCore_DP0 pid=38456) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=38456) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=38456) raise e (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=38456) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=38456) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=38456) return func(*args, **kwargs) (EngineCore_DP0 pid=38456) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=38456) super().init( (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in init (EngineCore_DP0 pid=38456) num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches( (EngineCore_DP0 pid=38456) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=38456) return func(*args, **kwargs) (EngineCore_DP0 pid=38456) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches (EngineCore_DP0 pid=38456) kv_cache_configs = get_kv_cache_configs( (EngineCore_DP0 pid=38456) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs (EngineCore_DP0 pid=38456) _check_enough_kv_cache_memory( (EngineCore_DP0 pid=38456) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory (EngineCore_DP0 pid=38456) raise ValueError( (EngineCore_DP0 pid=38456) ValueError: No available memory for the cache blocks. Try increasing gpu_memory_utilization when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details. (EngineCore_DP0 pid=38456) INFO 03-11 11:58:53 [ray_executor.py:119] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray. (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [backends.py:366] Compiling a graph for compile range (1, 512) takes 8.40 s [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [monitor.py:35] torch.compile takes 10.98 s in total [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster] (EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:53 [gpu_worker.py:424] Available KV cache memory: -0.6 GiB [repeated 3x across cluster] (APIServer pid=38276) Traceback (most recent call last): (APIServer pid=38276) File "<frozen runpy>", line 198, in _run_module_as_main (APIServer pid=38276) File "<frozen runpy>", line 88, in _run_code (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module> (APIServer pid=38276) uvloop.run(run_server(args)) (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=38276) return __asyncio.run( (APIServer pid=38276) ^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run (APIServer pid=38276) return runner.run(main) (APIServer pid=38276) ^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=38276) return self._loop.run_until_complete(task) (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=38276) return await main (APIServer pid=38276) ^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=38276) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=38276) async with build_async_engine_client( (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=38276) return await anext(self.gen) (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=38276) async with build_async_engine_client_from_engine_args( (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=38276) return await anext(self.gen) (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=38276) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=38276) return cls( (APIServer pid=38276) ^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=38276) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=38276) return func(*args, **kwargs) (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client (APIServer pid=38276) return AsyncMPClient(*client_args) (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=38276) return func(*args, **kwargs) (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in init (APIServer pid=38276) super().init( (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in init (APIServer pid=38276) with launch_core_engines( (APIServer pid=38276) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=38276) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=38276) next(self.gen) (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines (APIServer pid=38276) wait_for_engine_startup( (APIServer pid=38276) File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup (APIServer pid=38276) raise RuntimeError( (APIServer pid=38276) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Root Cause

I have four nodes, each with one A10 graphics card(run with ray cluster) If I switch to --tensor-parallel-size 4 --pipeline-parallel-size 1, it works fine. vllm serve bash

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --pipeline-parallel-size 1 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3

output log

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --pipeline-parallel-size 1 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice
 --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302] 
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302] 
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 160000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 4, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'enable_log_requests': True}
(APIServer pid=45912) INFO 03-11 12:21:59 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=45912) INFO 03-11 12:21:59 [model.py:1554] Using max model len 160000
(APIServer pid=45912) INFO 03-11 12:21:59 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=45912) WARNING 03-11 12:21:59 [config.py:381] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=45912) INFO 03-11 12:21:59 [config.py:401] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=45912) INFO 03-11 12:22:00 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=45912) INFO 03-11 12:22:00 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=45912) WARNING 03-11 12:22:00 [vllm.py:736] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=45912) INFO 03-11 12:22:00 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:12 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=160000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 64, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available.
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,089      INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,091      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397...
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,102      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 
(EngineCore_DP0 pid=46071) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=46071)   warnings.warn(
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:12 [ray_utils.py:417] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node e875191d64b14e913e979dca23cbd93d0cf1bd73b8d99bc5f187b375. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node fa31ba4f52f86b36f53b503e21ba5a07f032b3cb3307a54e93a6049e. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node 2e9e8346878efe22127046a23c414dce509677951707201671611d5f. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node 00be922b013b02b4c028d57e54de872af38430b268b1c538f8580458. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) WARNING 03-11 12:22:17 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server'
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:18 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:18 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:43113 backend=nccl
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:19 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:22:19 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:22:19 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:19 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:25 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) WARNING 03-11 12:22:17 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) WARNING 03-11 12:22:18 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:18 [parallel_state.py:1393] world_size=4 rank=3 local_rank=0 distributed_init_method=tcp://10.214.65.182:43113 backend=nccl [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:19 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:19 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:19 [parallel_state.py:1715] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:25 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B...
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:26 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:26 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [flash_attn.py:587] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:04,  3.04it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:04,  2.92it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:01<00:03,  2.92it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:01<00:03,  2.92it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:01<00:03,  2.93it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:02<00:02,  2.94it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:02<00:02,  2.96it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:02<00:02,  2.97it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:03<00:01,  2.73it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:03<00:01,  2.17it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:04<00:01,  1.92it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:05<00:01,  1.76it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:31 [default_loader.py:293] Loading weights took 5.43 seconds
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:25 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:26 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:05<00:00,  1.84it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:32 [gpu_model_runner.py:4338] Model loading took 16.52 GiB memory and 5.748502 seconds
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:05<00:00,  2.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:05<00:00,  2.40it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) 
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:33 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:11 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/5cdc441eb9/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:11 [backends.py:976] Dynamo bytecode transform time: 6.25 s
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:32 [default_loader.py:293] Loading weights took 6.15 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:32 [gpu_model_runner.py:4338] Model loading took 16.52 GiB memory and 6.458880 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:33 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:13 [backends.py:350] Cache the graph of compile range (1, 4096) for later use
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:23:15 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_A10.json
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [backends.py:366] Compiling a graph for compile range (1, 4096) takes 10.56 s
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [monitor.py:35] torch.compile takes 18.01 s in total
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_1_0/model
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:12 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/5cdc441eb9/rank_0_0/backbone for vLLM's torch.compile [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:12 [backends.py:976] Dynamo bytecode transform time: 7.21 s [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:14 [backends.py:350] Cache the graph of compile range (1, 4096) for later use [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:23:15 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_A10.json [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:24 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_1_0/model
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:25 [gpu_worker.py:424] Available KV cache memory: 2.09 GiB
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:26 [kv_cache_utils.py:1314] GPU KV cache size: 54,384 tokens
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:26 [kv_cache_utils.py:1319] Maximum concurrency for 160,000 tokens per request: 1.34x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 11/11 [00:02<00:00,  3.91it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 7/7 [00:01<00:00,  4.00it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:31 [gpu_model_runner.py:5360] Graph capturing finished in 5 secs, took 0.60 GiB
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [backends.py:366] Compiling a graph for compile range (1, 4096) takes 9.67 s [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [monitor.py:35] torch.compile takes 18.02 s in total [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_3_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:25 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:26 [gpu_worker.py:424] Available KV cache memory: 2.09 GiB [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:31 [core.py:282] init engine (profile, create kv cache, warmup model) took 58.65 seconds
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:38 [vllm.py:747] Asynchronous scheduling is disabled.
(APIServer pid=45912) INFO 03-11 12:23:38 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=45912) INFO 03-11 12:23:38 [logger.py:28] `--enable-log-requests` is set but the minimum log level is higher than DEBUG. Only limited information will be logged to minimize overhead. To view more details, set `VLLM_LOGGING_LEVEL=DEBUG`.
(APIServer pid=45912) INFO 03-11 12:23:38 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) WARNING 03-11 12:23:38 [model.py:1355] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=45912) INFO 03-11 12:23:38 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) INFO 03-11 12:23:38 [serving.py:185] Warming up chat template processing...
(APIServer pid=45912) INFO 03-11 12:23:39 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=45912) INFO 03-11 12:23:39 [serving.py:210] Chat template warmup completed in 1213.4ms
(APIServer pid=45912) INFO 03-11 12:23:39 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) INFO 03-11 12:23:39 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:38] Available routes are:
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=45912) INFO:     Started server process [45912]
(APIServer pid=45912) INFO:     Waiting for application startup.
(APIServer pid=45912) INFO:     Application startup complete.

Howerver, When I switch to PP>1, it occur OOM result: vllm serve bash

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3

output log

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 20000 --max-num-batched-tokens 512 --max-num-seqs 8 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --
tool-call-parser qwen3_coder --reasoning-parser qwen3
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] 
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] 
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 20000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'max_num_batched_tokens': 512, 'max_num_seqs': 8, 'enable_log_requests': True}
(APIServer pid=38276) INFO 03-11 11:57:52 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=38276) INFO 03-11 11:57:53 [model.py:1554] Using max model len 20000
(APIServer pid=38276) INFO 03-11 11:57:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=512.
(APIServer pid=38276) INFO 03-11 11:57:54 [config.py:544] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=38276) INFO 03-11 11:57:54 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=38276) WARNING 03-11 11:57:54 [vllm.py:736] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=38276) INFO 03-11 11:57:54 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=38456) WARNING 03-11 11:58:06 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available.
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,185      INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,187      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397...
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,199      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 
(EngineCore_DP0 pid=38456) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=38456)   warnings.warn(
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [ray_utils.py:417] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server'
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) WARNING 03-11 11:58:12 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:12 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:18 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:12 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1393] world_size=4 rank=2 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 2, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B...
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:02,  5.21it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:00<00:03,  3.50it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:01<00:02,  3.79it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:01<00:01,  5.54it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:01<00:01,  4.25it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:02<00:01,  3.57it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:02<00:01,  3.67it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:02<00:01,  3.25it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:02<00:00,  4.17it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.31 seconds
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:03<00:00,  3.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00,  4.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00,  3.97it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) 
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [gpu_model_runner.py:4338] Model loading took 17.58 GiB memory and 3.468356 seconds
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/b0831fa56d/rank_3_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.30 s
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.60 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:24 [gpu_model_runner.py:4338] Model loading took 17.59 GiB memory and 3.702975 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:39 [backends.py:350] Cache the graph of compile range (1, 512) for later use
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:41 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/108f95b37d/rank_2_0/backbone for vLLM's torch.compile [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.20 s [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [backends.py:366] Compiling a graph for compile range (1, 512) takes 7.18 s
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [monitor.py:35] torch.compile takes 9.55 s in total
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:40 [backends.py:350] Cache the graph of compile range (1, 512) for later use [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [gpu_worker.py:424] Available KV cache memory: 0.34 GiB
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:43 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     kv_cache_configs = get_kv_cache_configs(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     _check_enough_kv_cache_memory(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     raise ValueError(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore_DP0 pid=38456) Process EngineCore_DP0:
(EngineCore_DP0 pid=38456) Traceback (most recent call last):
(EngineCore_DP0 pid=38456)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=38456)     self.run()
(EngineCore_DP0 pid=38456)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=38456)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=38456)     raise e
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=38456)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=38456)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456)     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=38456)     super().__init__(
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=38456)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=38456)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456)     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches
(EngineCore_DP0 pid=38456)     kv_cache_configs = get_kv_cache_configs(
(EngineCore_DP0 pid=38456)                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs
(EngineCore_DP0 pid=38456)     _check_enough_kv_cache_memory(
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory
(EngineCore_DP0 pid=38456)     raise ValueError(
(EngineCore_DP0 pid=38456) ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:53 [ray_executor.py:119] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [backends.py:366] Compiling a graph for compile range (1, 512) takes 8.40 s [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [monitor.py:35] torch.compile takes 10.98 s in total [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:53 [gpu_worker.py:424] Available KV cache memory: -0.6 GiB [repeated 3x across cluster]
(APIServer pid=38276) Traceback (most recent call last):
(APIServer pid=38276)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=38276)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module>
(APIServer pid=38276)     uvloop.run(run_server(args))
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=38276)     return __asyncio.run(
(APIServer pid=38276)            ^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=38276)     return runner.run(main)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=38276)     return self._loop.run_until_complete(task)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=38276)     return await main
(APIServer pid=38276)            ^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=38276)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=38276)     async with build_async_engine_client(
(APIServer pid=38276)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=38276)     return await anext(self.gen)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=38276)     async with build_async_engine_client_from_engine_args(
(APIServer pid=38276)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=38276)     return await anext(self.gen)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=38276)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=38276)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=38276)     return cls(
(APIServer pid=38276)            ^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=38276)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=38276)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=38276)     return func(*args, **kwargs)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=38276)     return AsyncMPClient(*client_args)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=38276)     return func(*args, **kwargs)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=38276)     super().__init__(
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=38276)     with launch_core_engines(
(APIServer pid=38276)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=38276)     next(self.gen)
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=38276)     wait_for_engine_startup(
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=38276)     raise RuntimeError(
(APIServer pid=38276) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Fixed

PR fix notes

PR #36904: [WIP][BugFix] Fix PP OOM for Qwen3Next/Qwen3_5 by guarding embed_tokens and lm_head

Description (problem / solution / changelog)

Purpose

Fixes #36861.

When using pipeline parallelism (PP > 1), embed_tokens (VocabParallelEmbedding) is allocated on all PP ranks instead of only the first rank. Similarly, Qwen3NextForCausalLM.lm_head (ParallelLMHead) is allocated on all ranks instead of only the last rank. This wastes significant GPU memory (~1-1.5 GB per unnecessary layer), which on tight-memory GPUs (e.g., A10 24GB) pushes the available KV cache memory to ≤ 0, triggering:

ValueError: No available memory for the cache blocks.

This PR adds PP rank guards following the canonical pattern used by Llama and other models:

  • Qwen3NextModel.embed_tokens: Only allocate on first rank (or last rank when tie_word_embeddings=True), use PPMissingLayer() otherwise.
  • Qwen3NextForCausalLM.lm_head: Only allocate on last rank, use PPMissingLayer() otherwise.
  • Qwen3_5Model.embed_tokens: Same guard as above.

No changes to forward() methods (already guarded with is_first_rank/is_last_rank), weight loading (both AutoWeightsLoader and is_pp_missing_parameter() already handle PPMissingLayer), or Qwen3_5ForCausalLMBase.lm_head (already correctly guarded).

Test Plan

Test Result

Changed files

  • vllm/model_executor/models/qwen3_5.py (modified, +9/-4)
  • vllm/model_executor/models/qwen3_next.py (modified, +19/-9)

Code Example

Your output of `python collect_env.py` here

---

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --pipeline-parallel-size 1 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3

---

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --pipeline-parallel-size 1 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice
 --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302] 
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302] 
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 160000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 4, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'enable_log_requests': True}
(APIServer pid=45912) INFO 03-11 12:21:59 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=45912) INFO 03-11 12:21:59 [model.py:1554] Using max model len 160000
(APIServer pid=45912) INFO 03-11 12:21:59 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=45912) WARNING 03-11 12:21:59 [config.py:381] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=45912) INFO 03-11 12:21:59 [config.py:401] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=45912) INFO 03-11 12:22:00 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=45912) INFO 03-11 12:22:00 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=45912) WARNING 03-11 12:22:00 [vllm.py:736] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=45912) INFO 03-11 12:22:00 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:12 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=160000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 64, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available.
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,089      INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,091      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397...
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,102      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 
(EngineCore_DP0 pid=46071) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=46071)   warnings.warn(
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:12 [ray_utils.py:417] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node e875191d64b14e913e979dca23cbd93d0cf1bd73b8d99bc5f187b375. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node fa31ba4f52f86b36f53b503e21ba5a07f032b3cb3307a54e93a6049e. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node 2e9e8346878efe22127046a23c414dce509677951707201671611d5f. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node 00be922b013b02b4c028d57e54de872af38430b268b1c538f8580458. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) WARNING 03-11 12:22:17 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server'
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:18 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:18 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:43113 backend=nccl
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:19 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:22:19 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:22:19 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:19 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:25 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) WARNING 03-11 12:22:17 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) WARNING 03-11 12:22:18 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:18 [parallel_state.py:1393] world_size=4 rank=3 local_rank=0 distributed_init_method=tcp://10.214.65.182:43113 backend=nccl [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:19 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:19 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:19 [parallel_state.py:1715] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:25 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B...
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:26 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:26 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [flash_attn.py:587] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:04,  3.04it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:04,  2.92it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:01<00:03,  2.92it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:01<00:03,  2.92it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:01<00:03,  2.93it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:02<00:02,  2.94it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:02<00:02,  2.96it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:02<00:02,  2.97it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:03<00:01,  2.73it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:03<00:01,  2.17it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:04<00:01,  1.92it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:05<00:01,  1.76it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:31 [default_loader.py:293] Loading weights took 5.43 seconds
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:25 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:26 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:05<00:00,  1.84it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:32 [gpu_model_runner.py:4338] Model loading took 16.52 GiB memory and 5.748502 seconds
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:05<00:00,  2.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:05<00:00,  2.40it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) 
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:33 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:11 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/5cdc441eb9/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:11 [backends.py:976] Dynamo bytecode transform time: 6.25 s
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:32 [default_loader.py:293] Loading weights took 6.15 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:32 [gpu_model_runner.py:4338] Model loading took 16.52 GiB memory and 6.458880 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:33 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:13 [backends.py:350] Cache the graph of compile range (1, 4096) for later use
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:23:15 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_A10.json
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [backends.py:366] Compiling a graph for compile range (1, 4096) takes 10.56 s
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [monitor.py:35] torch.compile takes 18.01 s in total
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_1_0/model
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:12 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/5cdc441eb9/rank_0_0/backbone for vLLM's torch.compile [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:12 [backends.py:976] Dynamo bytecode transform time: 7.21 s [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:14 [backends.py:350] Cache the graph of compile range (1, 4096) for later use [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:23:15 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_A10.json [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:24 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_1_0/model
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:25 [gpu_worker.py:424] Available KV cache memory: 2.09 GiB
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:26 [kv_cache_utils.py:1314] GPU KV cache size: 54,384 tokens
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:26 [kv_cache_utils.py:1319] Maximum concurrency for 160,000 tokens per request: 1.34x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 11/11 [00:02<00:00,  3.91it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 7/7 [00:01<00:00,  4.00it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:31 [gpu_model_runner.py:5360] Graph capturing finished in 5 secs, took 0.60 GiB
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [backends.py:366] Compiling a graph for compile range (1, 4096) takes 9.67 s [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [monitor.py:35] torch.compile takes 18.02 s in total [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_3_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:25 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:26 [gpu_worker.py:424] Available KV cache memory: 2.09 GiB [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:31 [core.py:282] init engine (profile, create kv cache, warmup model) took 58.65 seconds
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:38 [vllm.py:747] Asynchronous scheduling is disabled.
(APIServer pid=45912) INFO 03-11 12:23:38 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=45912) INFO 03-11 12:23:38 [logger.py:28] `--enable-log-requests` is set but the minimum log level is higher than DEBUG. Only limited information will be logged to minimize overhead. To view more details, set `VLLM_LOGGING_LEVEL=DEBUG`.
(APIServer pid=45912) INFO 03-11 12:23:38 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) WARNING 03-11 12:23:38 [model.py:1355] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=45912) INFO 03-11 12:23:38 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) INFO 03-11 12:23:38 [serving.py:185] Warming up chat template processing...
(APIServer pid=45912) INFO 03-11 12:23:39 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=45912) INFO 03-11 12:23:39 [serving.py:210] Chat template warmup completed in 1213.4ms
(APIServer pid=45912) INFO 03-11 12:23:39 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) INFO 03-11 12:23:39 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:38] Available routes are:
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=45912) INFO:     Started server process [45912]
(APIServer pid=45912) INFO:     Waiting for application startup.
(APIServer pid=45912) INFO:     Application startup complete.

---

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3

---

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 20000 --max-num-batched-tokens 512 --max-num-seqs 8 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --
tool-call-parser qwen3_coder --reasoning-parser qwen3
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] 
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] 
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 20000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'max_num_batched_tokens': 512, 'max_num_seqs': 8, 'enable_log_requests': True}
(APIServer pid=38276) INFO 03-11 11:57:52 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=38276) INFO 03-11 11:57:53 [model.py:1554] Using max model len 20000
(APIServer pid=38276) INFO 03-11 11:57:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=512.
(APIServer pid=38276) INFO 03-11 11:57:54 [config.py:544] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=38276) INFO 03-11 11:57:54 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=38276) WARNING 03-11 11:57:54 [vllm.py:736] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=38276) INFO 03-11 11:57:54 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=38456) WARNING 03-11 11:58:06 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available.
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,185      INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,187      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397...
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,199      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 
(EngineCore_DP0 pid=38456) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=38456)   warnings.warn(
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [ray_utils.py:417] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server'
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) WARNING 03-11 11:58:12 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:12 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:18 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:12 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1393] world_size=4 rank=2 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 2, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B...
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:02,  5.21it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:00<00:03,  3.50it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:01<00:02,  3.79it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:01<00:01,  5.54it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:01<00:01,  4.25it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:02<00:01,  3.57it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:02<00:01,  3.67it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:02<00:01,  3.25it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:02<00:00,  4.17it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.31 seconds
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:03<00:00,  3.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00,  4.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00,  3.97it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) 
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [gpu_model_runner.py:4338] Model loading took 17.58 GiB memory and 3.468356 seconds
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/b0831fa56d/rank_3_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.30 s
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.60 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:24 [gpu_model_runner.py:4338] Model loading took 17.59 GiB memory and 3.702975 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:39 [backends.py:350] Cache the graph of compile range (1, 512) for later use
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:41 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/108f95b37d/rank_2_0/backbone for vLLM's torch.compile [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.20 s [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [backends.py:366] Compiling a graph for compile range (1, 512) takes 7.18 s
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [monitor.py:35] torch.compile takes 9.55 s in total
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:40 [backends.py:350] Cache the graph of compile range (1, 512) for later use [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [gpu_worker.py:424] Available KV cache memory: 0.34 GiB
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:43 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     kv_cache_configs = get_kv_cache_configs(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     _check_enough_kv_cache_memory(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     raise ValueError(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore_DP0 pid=38456) Process EngineCore_DP0:
(EngineCore_DP0 pid=38456) Traceback (most recent call last):
(EngineCore_DP0 pid=38456)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=38456)     self.run()
(EngineCore_DP0 pid=38456)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=38456)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=38456)     raise e
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=38456)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=38456)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456)     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=38456)     super().__init__(
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=38456)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=38456)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456)     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches
(EngineCore_DP0 pid=38456)     kv_cache_configs = get_kv_cache_configs(
(EngineCore_DP0 pid=38456)                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs
(EngineCore_DP0 pid=38456)     _check_enough_kv_cache_memory(
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory
(EngineCore_DP0 pid=38456)     raise ValueError(
(EngineCore_DP0 pid=38456) ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:53 [ray_executor.py:119] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [backends.py:366] Compiling a graph for compile range (1, 512) takes 8.40 s [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [monitor.py:35] torch.compile takes 10.98 s in total [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:53 [gpu_worker.py:424] Available KV cache memory: -0.6 GiB [repeated 3x across cluster]
(APIServer pid=38276) Traceback (most recent call last):
(APIServer pid=38276)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=38276)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module>
(APIServer pid=38276)     uvloop.run(run_server(args))
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=38276)     return __asyncio.run(
(APIServer pid=38276)            ^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=38276)     return runner.run(main)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=38276)     return self._loop.run_until_complete(task)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=38276)     return await main
(APIServer pid=38276)            ^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=38276)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=38276)     async with build_async_engine_client(
(APIServer pid=38276)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=38276)     return await anext(self.gen)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=38276)     async with build_async_engine_client_from_engine_args(
(APIServer pid=38276)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=38276)     return await anext(self.gen)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=38276)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=38276)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=38276)     return cls(
(APIServer pid=38276)            ^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=38276)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=38276)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=38276)     return func(*args, **kwargs)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=38276)     return AsyncMPClient(*client_args)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=38276)     return func(*args, **kwargs)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=38276)     super().__init__(
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=38276)     with launch_core_engines(
(APIServer pid=38276)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=38276)     next(self.gen)
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=38276)     wait_for_engine_startup(
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=38276)     raise RuntimeError(
(APIServer pid=38276) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

I have four nodes, each with one A10 graphics card(run with ray cluster) If I switch to --tensor-parallel-size 4 --pipeline-parallel-size 1, it works fine. vllm serve bash

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --pipeline-parallel-size 1 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3

output log

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 4 --pipeline-parallel-size 1 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice
 --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302] 
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:302] 
(APIServer pid=45912) INFO 03-11 12:21:59 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 160000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'tensor_parallel_size': 4, 'enable_prefix_caching': True, 'max_num_batched_tokens': 4096, 'max_num_seqs': 32, 'enable_log_requests': True}
(APIServer pid=45912) INFO 03-11 12:21:59 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=45912) INFO 03-11 12:21:59 [model.py:1554] Using max model len 160000
(APIServer pid=45912) INFO 03-11 12:21:59 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=4096.
(APIServer pid=45912) WARNING 03-11 12:21:59 [config.py:381] Mamba cache mode is set to 'align' for Qwen3_5MoeForConditionalGeneration by default when prefix caching is enabled
(APIServer pid=45912) INFO 03-11 12:21:59 [config.py:401] Warning: Prefix caching in Mamba cache 'align' mode is currently enabled. Its support for Mamba layers is experimental. Please report any issues you may observe.
(APIServer pid=45912) INFO 03-11 12:22:00 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=45912) INFO 03-11 12:22:00 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=45912) WARNING 03-11 12:22:00 [vllm.py:736] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=45912) INFO 03-11 12:22:00 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:12 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=160000, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [4096], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 64, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available.
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,089      INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,091      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397...
(EngineCore_DP0 pid=46071) 2026-03-11 12:22:12,102      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 
(EngineCore_DP0 pid=46071) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=46071)   warnings.warn(
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:12 [ray_utils.py:417] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node e875191d64b14e913e979dca23cbd93d0cf1bd73b8d99bc5f187b375. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node fa31ba4f52f86b36f53b503e21ba5a07f032b3cb3307a54e93a6049e. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node 2e9e8346878efe22127046a23c414dce509677951707201671611d5f. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) WARNING 03-11 12:22:12 [ray_utils.py:228] tensor_parallel_size=4 is bigger than a reserved number of GPUs (1 GPUs) in a node 00be922b013b02b4c028d57e54de872af38430b268b1c538f8580458. Tensor parallel workers can be spread out to 2+ nodes which can degrade the performance unless you have fast interconnect across nodes, like Infiniband. To resolve this issue, make sure you have more than 4 GPUs available at each node.
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=46071) INFO 03-11 12:22:17 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) WARNING 03-11 12:22:17 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server'
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:18 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:18 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:43113 backend=nccl
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:19 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:22:19 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:22:19 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:19 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 2, EP rank 2, EPLB rank N/A
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:25 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) WARNING 03-11 12:22:17 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) WARNING 03-11 12:22:18 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:18 [parallel_state.py:1393] world_size=4 rank=3 local_rank=0 distributed_init_method=tcp://10.214.65.182:43113 backend=nccl [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:19 [symm_mem.py:67] SymmMemCommunicator: Device capability 8.6 not supported, communicator is not available. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:22:19 [custom_all_reduce.py:92] Custom allreduce is disabled because this process group spans across nodes. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:19 [parallel_state.py:1715] rank 3 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 3, EP rank 3, EPLB rank N/A [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:25 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B...
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:26 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:26 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [flash_attn.py:587] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster]
Loading safetensors checkpoint shards:   7% Completed | 1/14 [00:00<00:04,  3.04it/s]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:04,  2.92it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:01<00:03,  2.92it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:01<00:03,  2.92it/s]
Loading safetensors checkpoint shards:  36% Completed | 5/14 [00:01<00:03,  2.93it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:02<00:02,  2.94it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:02<00:02,  2.96it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:02<00:02,  2.97it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:03<00:01,  2.73it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:03<00:01,  2.17it/s]
Loading safetensors checkpoint shards:  79% Completed | 11/14 [00:04<00:01,  1.92it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:05<00:01,  1.76it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:31 [default_loader.py:293] Loading weights took 5.43 seconds
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:25 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:26 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:22:26 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:26 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster]
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:05<00:00,  1.84it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:32 [gpu_model_runner.py:4338] Model loading took 16.52 GiB memory and 5.748502 seconds
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:05<00:00,  2.28it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:05<00:00,  2.40it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) 
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:22:33 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:11 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/5cdc441eb9/rank_1_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:11 [backends.py:976] Dynamo bytecode transform time: 6.25 s
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:32 [default_loader.py:293] Loading weights took 6.15 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:22:32 [gpu_model_runner.py:4338] Model loading took 16.52 GiB memory and 6.458880 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15740, ip=10.214.31.213) INFO 03-11 12:22:33 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:13 [backends.py:350] Cache the graph of compile range (1, 4096) for later use
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) WARNING 03-11 12:23:15 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_A10.json
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [backends.py:366] Compiling a graph for compile range (1, 4096) takes 10.56 s
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [monitor.py:35] torch.compile takes 18.01 s in total
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:23 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_1_0/model
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:12 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/5cdc441eb9/rank_0_0/backbone for vLLM's torch.compile [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:12 [backends.py:976] Dynamo bytecode transform time: 7.21 s [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:14 [backends.py:350] Cache the graph of compile range (1, 4096) for later use [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) WARNING 03-11 12:23:15 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=128,device_name=NVIDIA_A10.json [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:24 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_1_0/model
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:25 [gpu_worker.py:424] Available KV cache memory: 2.09 GiB
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:26 [kv_cache_utils.py:1314] GPU KV cache size: 54,384 tokens
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:26 [kv_cache_utils.py:1319] Maximum concurrency for 160,000 tokens per request: 1.34x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|██████████| 11/11 [00:02<00:00,  3.91it/s]
Capturing CUDA graphs (decode, FULL): 100%|██████████| 7/7 [00:01<00:00,  4.00it/s]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=33834, ip=10.214.117.111) INFO 03-11 12:23:31 [gpu_model_runner.py:5360] Graph capturing finished in 5 secs, took 0.60 GiB
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [backends.py:366] Compiling a graph for compile range (1, 4096) takes 9.67 s [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [monitor.py:35] torch.compile takes 18.02 s in total [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=15732, ip=10.214.79.32) INFO 03-11 12:23:23 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_3_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:25 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/84065e09023fc5c1e8dd6ff467d3800407a3d4c8e95644be4e9021b4d135d8c5/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) (RayWorkerWrapper pid=46266) INFO 03-11 12:23:26 [gpu_worker.py:424] Available KV cache memory: 2.09 GiB [repeated 3x across cluster]
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:31 [core.py:282] init engine (profile, create kv cache, warmup model) took 58.65 seconds
(EngineCore_DP0 pid=46071) INFO 03-11 12:23:38 [vllm.py:747] Asynchronous scheduling is disabled.
(APIServer pid=45912) INFO 03-11 12:23:38 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=45912) INFO 03-11 12:23:38 [logger.py:28] `--enable-log-requests` is set but the minimum log level is higher than DEBUG. Only limited information will be logged to minimize overhead. To view more details, set `VLLM_LOGGING_LEVEL=DEBUG`.
(APIServer pid=45912) INFO 03-11 12:23:38 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) WARNING 03-11 12:23:38 [model.py:1355] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'top_k': 20, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=45912) INFO 03-11 12:23:38 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) INFO 03-11 12:23:38 [serving.py:185] Warming up chat template processing...
(APIServer pid=45912) INFO 03-11 12:23:39 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=45912) INFO 03-11 12:23:39 [serving.py:210] Chat template warmup completed in 1213.4ms
(APIServer pid=45912) INFO 03-11 12:23:39 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=45912) INFO 03-11 12:23:39 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8000
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:38] Available routes are:
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /docs, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /redoc, Methods: HEAD, GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=45912) INFO 03-11 12:23:39 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=45912) INFO:     Started server process [45912]
(APIServer pid=45912) INFO:     Waiting for application startup.
(APIServer pid=45912) INFO:     Application startup complete.

Howerver, When I switch to PP>1, it occur OOM result: vllm serve bash

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 160000 --max-num-batched-tokens 4096 --max-num-seqs 32 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --enable-prefix-caching --reasoning-parser qwen3

output log

root@xuanwu-text-safety-qwen3-5-1358612-cfrh7:/data# python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 4 --max-model-len 20000 --max-num-batched-tokens 512 --max-num-seqs 8 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --
tool-call-parser qwen3_coder --reasoning-parser qwen3
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] 
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]        █     █     █▄   ▄█
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.17.0
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]   █▄█▀ █     █     █     █  model   /athena/Qwen3.5-35B-A3B
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:302] 
(APIServer pid=38276) INFO 03-11 11:57:52 [utils.py:238] non-default args: {'enable_auto_tool_choice': True, 'tool_call_parser': 'qwen3_coder', 'enable_log_outputs': True, 'model': '/athena/Qwen3.5-35B-A3B', 'max_model_len': 20000, 'served_model_name': ['Qwen3.5-35B-A3B'], 'reasoning_parser': 'qwen3', 'distributed_executor_backend': 'ray', 'pipeline_parallel_size': 4, 'max_num_batched_tokens': 512, 'max_num_seqs': 8, 'enable_log_requests': True}
(APIServer pid=38276) INFO 03-11 11:57:52 [model.py:531] Resolved architecture: Qwen3_5MoeForConditionalGeneration
(APIServer pid=38276) INFO 03-11 11:57:53 [model.py:1554] Using max model len 20000
(APIServer pid=38276) INFO 03-11 11:57:53 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=512.
(APIServer pid=38276) INFO 03-11 11:57:54 [config.py:544] Setting attention block size to 1056 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=38276) INFO 03-11 11:57:54 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=38276) WARNING 03-11 11:57:54 [vllm.py:736] Async scheduling will be disabled because it is not supported with the `ray` distributed executor backend (only `mp`, `uni`, and `external_launcher` are supported).
(APIServer pid=38276) INFO 03-11 11:57:54 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [core.py:101] Initializing a V1 LLM engine (v0.17.0) with config: model='/athena/Qwen3.5-35B-A3B', speculative_config=None, tokenizer='/athena/Qwen3.5-35B-A3B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=20000, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=4, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [512], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 16, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=38456) WARNING 03-11 11:58:06 [ray_utils.py:352] Tensor parallel size (4) exceeds available GPUs (1). This may result in Ray placement group allocation failures. Consider reducing tensor_parallel_size to 1 or less, or ensure your Ray cluster has 4 GPUs available.
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,185      INFO worker.py:1669 -- Using address 10.214.65.182:6397 set in the environment variable RAY_ADDRESS
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,187      INFO worker.py:1810 -- Connecting to existing Ray cluster at address: 10.214.65.182:6397...
(EngineCore_DP0 pid=38456) 2026-03-11 11:58:06,199      INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://10.214.65.182:8265 
(EngineCore_DP0 pid=38456) /usr/local/lib/python3.12/dist-packages/ray/_private/worker.py:2052: FutureWarning: Tip: In future versions of Ray, Ray will no longer override accelerator visible devices env var if num_gpus=0 or num_gpus=None (default). To enable this behavior and turn off this error message, set RAY_ACCEL_ENV_VAR_OVERRIDE_ON_ZERO=0
(EngineCore_DP0 pid=38456)   warnings.warn(
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:06 [ray_utils.py:417] No current placement group found. Creating a new placement group.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:100] Env var prefixes to copy: ['HF_', 'HUGGING_FACE_', 'LMCACHE_', 'NCCL_', 'UCX_', 'VLLM_']
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:101] Copying the following environment variables to workers: ['LD_LIBRARY_PATH', 'NCCL_VERSION', 'VLLM_WORKER_MULTIPROC_METHOD']
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:11 [ray_env.py:111] To exclude env vars from copying, add them to /root/.config/vllm/ray_non_carry_over_env_vars.json
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server'
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) WARNING 03-11 11:58:12 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:12 [parallel_state.py:1393] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [pynccl.py:111] vLLM is using nccl==2.27.5
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:18 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) WARNING 03-11 11:58:11 [system_utils.py:38] Overwriting environment variable LD_LIBRARY_PATH from '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' to '/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/lib/python3.12/dist-packages/cv2/../../lib64:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda/lib64:/usr/local/cuda/targets/x86_64-linux/lib/stubs:/usr/local/jdk/jre/lib/amd64/server' [repeated 3x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:12 [worker_base.py:301] Missing `shared_worker_lock` argument from executor. This argument is needed for mm_processor_cache_type='shm'. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1393] world_size=4 rank=2 local_rank=0 distributed_init_method=tcp://10.214.65.182:56301 backend=nccl [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:13 [parallel_state.py:1715] rank 2 in world size 4 is assigned as DP rank 0, PP rank 2, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [gpu_model_runner.py:4255] Starting to load model /athena/Qwen3.5-35B-A3B...
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:19 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION'].
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2
Loading safetensors checkpoint shards:   0% Completed | 0/14 [00:00<?, ?it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead. [repeated 3x across cluster]
Loading safetensors checkpoint shards:  14% Completed | 2/14 [00:00<00:02,  5.21it/s]
Loading safetensors checkpoint shards:  21% Completed | 3/14 [00:00<00:03,  3.50it/s]
Loading safetensors checkpoint shards:  29% Completed | 4/14 [00:01<00:02,  3.79it/s]
Loading safetensors checkpoint shards:  43% Completed | 6/14 [00:01<00:01,  5.54it/s]
Loading safetensors checkpoint shards:  50% Completed | 7/14 [00:01<00:01,  4.25it/s]
Loading safetensors checkpoint shards:  57% Completed | 8/14 [00:02<00:01,  3.57it/s]
Loading safetensors checkpoint shards:  64% Completed | 9/14 [00:02<00:01,  3.67it/s]
Loading safetensors checkpoint shards:  71% Completed | 10/14 [00:02<00:01,  3.25it/s]
Loading safetensors checkpoint shards:  86% Completed | 12/14 [00:02<00:00,  4.17it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.31 seconds
Loading safetensors checkpoint shards:  93% Completed | 13/14 [00:03<00:00,  3.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00,  4.05it/s]
Loading safetensors checkpoint shards: 100% Completed | 14/14 [00:03<00:00,  3.97it/s]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) 
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:23 [gpu_model_runner.py:4338] Model loading took 17.58 GiB memory and 3.468356 seconds
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [base.py:106] Offloader set to NoopOffloader [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/b0831fa56d/rank_3_0/backbone for vLLM's torch.compile
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.30 s
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [cuda.py:453] Using backend AttentionBackendEnum.FLASH_ATTN for vit attention [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:19 [mm_encoder_attention.py:215] Using AttentionBackendEnum.FLASH_ATTN for MMEncoderAttention. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [unquantized.py:186] Using TRITON backend for Unquantized MoE [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [cuda.py:405] Using FLASH_ATTN attention backend out of potential backends: ['FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION']. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:20 [flash_attn.py:587] Using FlashAttention version 2 [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:23 [default_loader.py:293] Loading weights took 3.60 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:24 [gpu_model_runner.py:4338] Model loading took 17.59 GiB memory and 3.702975 seconds [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:24 [gpu_model_runner.py:5254] Encoder cache will be initialized with a budget of 16384 tokens, and profiled with 1 image items of the maximum feature size. [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:39 [backends.py:350] Cache the graph of compile range (1, 512) for later use
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) WARNING 03-11 11:58:41 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:916] Using cache directory: /root/.cache/vllm/torch_compile_cache/108f95b37d/rank_2_0/backbone for vLLM's torch.compile [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:38 [backends.py:976] Dynamo bytecode transform time: 2.20 s [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [backends.py:366] Compiling a graph for compile range (1, 512) takes 7.18 s
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [monitor.py:35] torch.compile takes 9.55 s in total
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:45 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14816, ip=10.214.31.213) INFO 03-11 11:58:40 [backends.py:350] Cache the graph of compile range (1, 512) for later use [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/bad3c8e6c906073b11190b5c18d3d28bcd4afee92113dc44df568d7e6e8e6d0a/rank_1_0/model
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) INFO 03-11 11:58:46 [gpu_worker.py:424] Available KV cache memory: 0.34 GiB
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=29278, ip=10.214.117.111) WARNING 03-11 11:58:43 [fused_moe.py:1093] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/fused_moe/configs/E=256,N=512,device_name=NVIDIA_A10.json [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] EngineCore failed to start.
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] Traceback (most recent call last):
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     super().__init__(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     kv_cache_configs = get_kv_cache_configs(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     _check_enough_kv_cache_memory(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100]     raise ValueError(
(EngineCore_DP0 pid=38456) ERROR 03-11 11:58:53 [core.py:1100] ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore_DP0 pid=38456) Process EngineCore_DP0:
(EngineCore_DP0 pid=38456) Traceback (most recent call last):
(EngineCore_DP0 pid=38456)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=38456)     self.run()
(EngineCore_DP0 pid=38456)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=38456)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=38456)     raise e
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core
(EngineCore_DP0 pid=38456)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore_DP0 pid=38456)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456)     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 834, in __init__
(EngineCore_DP0 pid=38456)     super().__init__(
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 120, in __init__
(EngineCore_DP0 pid=38456)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(
(EngineCore_DP0 pid=38456)                                                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore_DP0 pid=38456)     return func(*args, **kwargs)
(EngineCore_DP0 pid=38456)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 263, in _initialize_kv_caches
(EngineCore_DP0 pid=38456)     kv_cache_configs = get_kv_cache_configs(
(EngineCore_DP0 pid=38456)                        ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 1572, in get_kv_cache_configs
(EngineCore_DP0 pid=38456)     _check_enough_kv_cache_memory(
(EngineCore_DP0 pid=38456)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/core/kv_cache_utils.py", line 623, in _check_enough_kv_cache_memory
(EngineCore_DP0 pid=38456)     raise ValueError(
(EngineCore_DP0 pid=38456) ValueError: No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine. See https://docs.vllm.ai/en/latest/configuration/conserving_memory/ for more details.
(EngineCore_DP0 pid=38456) INFO 03-11 11:58:53 [ray_executor.py:119] Shutting down Ray distributed executor. If you see error log from logging.cc regarding SIGTERM received, please ignore because this is the expected termination process in Ray.
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [backends.py:366] Compiling a graph for compile range (1, 512) takes 8.40 s [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [monitor.py:35] torch.compile takes 10.98 s in total [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:580] saving AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=38635) INFO 03-11 11:58:47 [decorators.py:588] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/c35a1b05c386859b85bba45ca51d80e0afe8c7ca11247054a14a82723fa78d52/rank_0_0/model [repeated 3x across cluster]
(EngineCore_DP0 pid=38456) (RayWorkerWrapper pid=14803, ip=10.214.79.32) INFO 03-11 11:58:53 [gpu_worker.py:424] Available KV cache memory: -0.6 GiB [repeated 3x across cluster]
(APIServer pid=38276) Traceback (most recent call last):
(APIServer pid=38276)   File "<frozen runpy>", line 198, in _run_module_as_main
(APIServer pid=38276)   File "<frozen runpy>", line 88, in _run_code
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 545, in <module>
(APIServer pid=38276)     uvloop.run(run_server(args))
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=38276)     return __asyncio.run(
(APIServer pid=38276)            ^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=38276)     return runner.run(main)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=38276)     return self._loop.run_until_complete(task)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=38276)     return await main
(APIServer pid=38276)            ^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server
(APIServer pid=38276)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker
(APIServer pid=38276)     async with build_async_engine_client(
(APIServer pid=38276)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=38276)     return await anext(self.gen)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client
(APIServer pid=38276)     async with build_async_engine_client_from_engine_args(
(APIServer pid=38276)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=38276)     return await anext(self.gen)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args
(APIServer pid=38276)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=38276)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=38276)     return cls(
(APIServer pid=38276)            ^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=38276)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=38276)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=38276)     return func(*args, **kwargs)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client
(APIServer pid=38276)     return AsyncMPClient(*client_args)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=38276)     return func(*args, **kwargs)
(APIServer pid=38276)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 911, in __init__
(APIServer pid=38276)     super().__init__(
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 569, in __init__
(APIServer pid=38276)     with launch_core_engines(
(APIServer pid=38276)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=38276)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=38276)     next(self.gen)
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines
(APIServer pid=38276)     wait_for_engine_startup(
(APIServer pid=38276)   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup
(APIServer pid=38276)     raise RuntimeError(
(APIServer pid=38276) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

What is the difference between --tensor-parallel-size and --pipeline-parallel-size?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue arises from the fact that the --pipeline-parallel-size is set to 4, which exceeds the available GPU memory. To fix this issue, you can try the following steps:

  • Reduce the --pipeline-parallel-size to a value that is less than or equal to the number of GPUs available.
  • Increase the --gpu-memory-utilization to a value closer to 1, but be cautious not to exceed the available GPU memory.
  • Reduce the model size or complexity to reduce the memory requirements.

Here is an example of how you can modify the command to reduce the --pipeline-parallel-size:

python3 -m vllm.entrypoints.openai.api_server --served-model-name Qwen3.5-35B-A3B --model /athena/Qwen3.5-35B-A3B --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --pipeline-parallel-size 2 --max-model-len 20000 --max-num-batched-tokens 512 --max-num-seqs 8 --distributed-executor-backend ray --enable-log-requests --enable-log-outputs --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3

Note that the optimal values for --pipeline-parallel-size and --gpu-memory-utilization will depend on the specific hardware and model being used.

Verification

To verify that the fix worked, you can check the output logs for any error messages related to GPU memory or pipeline parallelism. You can also monitor the GPU memory usage using tools like nvidia-smi to ensure that the memory usage is within the expected range.

Extra Tips

  • Make sure to check the documentation for the specific model and hardware being used to determine the optimal values for --pipeline-parallel-size and --gpu-memory-utilization.
  • Consider using a smaller model or reducing the batch size to reduce the memory requirements.
  • If you are still experiencing issues, try reducing the --max-model-len or --max-num-batched-tokens to reduce the memory requirements.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Why does setting `--pipeline-parallel-size > 1` result in an OOM error, but `--tensor-parallel-size> 1` does not? [1 pull requests, 1 participants]