vllm - 💡(How to fix) Fix [Feature]: Deepseek V4 cannot run ,Please support SM120 GPU,example rtx5090 rtxpro6000 [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40802Fetched 2026-04-25 06:03:57
View on GitHub
Comments
2
Participants
2
Timeline
7
Reactions
0
Participants
Timeline (top)
subscribed ×4commented ×2labeled ×1

Error Message

time="2026-04-24T18:49:16+08:00" level=error msg="error waiting for container: unexpected EOF" (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] WorkerProc hit an exception. (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.model_runner.profile_run() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states, last_hidden_states = self._dummy_run( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] outputs = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.runnable(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward

Root Cause

PS C:\Users\wuwen> docker run --gpus all --privileged --ipc=host -p 8005:8005 -e CUDA_DEVICE_ORDER=PCI_BUS_ID -e CUDA_VISIBLE_DEVICES=0,1 -e NCCL_CUMEM_ENABLE=0 -v I:\AI-Chat\models\Deepseek:/models vllm/vllm-openai:deepseekv4-cu130 /models/DeepSeek-V4-Flash --trust-remote-code --kv-cache-dtype fp8 --block-size 256 --enable-expert-parallel --tensor-parallel-size 2 --attention_config.use_fp4_indexer_cache=True --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice --reasoning-parser deepseek_v4 --host 0.0.0.0 --served-model-name VLLM-MODEL --gpu-memory-utilization 0.95 --async-scheduling --enable-prefix-caching --max-num-seqs 2 (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.1.dev15830+g8d599d76a (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] █▄█▀ █ █ █ █ model /models/DeepSeek-V4-Flash (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:233] non-default args: {'model_tag': '/models/DeepSeek-V4-Flash', 'enable_auto_tool_choice': True, 'tool_call_parser': 'deepseek_v4', 'host': '0.0.0.0', 'model': '/models/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'served_model_name': ['VLLM-MODEL'], 'reasoning_parser': 'deepseek_v4', 'tensor_parallel_size': 2, 'enable_expert_parallel': True, 'block_size': 256, 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_seqs': 2, 'async_scheduling': True, 'attention_config': AttentionConfig(backend=None, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=True, disable_flashinfer_q_quantization=False, use_prefill_query_quantization=False, use_fp4_indexer_cache=True)} (APIServer pid=1) INFO 04-24 10:27:00 [config.py:763] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM. (APIServer pid=1) INFO 04-24 10:27:00 [config.py:449] Replacing legacy 'type' key with 'rope_type' (APIServer pid=1) INFO 04-24 10:27:11 [model.py:555] Resolved architecture: DeepseekV4ForCausalLM (APIServer pid=1) INFO 04-24 10:27:11 [model.py:1689] Using max model len 1048576 (APIServer pid=1) INFO 04-24 10:27:11 [cache.py:267] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor (APIServer pid=1) INFO 04-24 10:27:11 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192. (APIServer pid=1) INFO 04-24 10:27:11 [vllm.py:819] Asynchronous scheduling is enabled. (APIServer pid=1) INFO 04-24 10:27:11 [kernel.py:201] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native']) (APIServer pid=1) WARNING 04-24 10:27:12 [vllm.py:1234] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implemented the reasoning_start_str and reasoning_end_str. (APIServer pid=1) INFO 04-24 10:27:12 [compilation.py:294] Enabled custom fusions: norm_quant, act_quant (EngineCore pid=273) INFO 04-24 10:27:20 [core.py:108] Initializing a V1 LLM engine (v0.1.dev15830+g8d599d76a) with config: model='/models/DeepSeek-V4-Flash', speculative_config=None, tokenizer='/models/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=VLLM-MODEL, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 4, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto') (EngineCore pid=273) WARNING 04-24 10:27:20 [multiproc_executor.py:1038] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore pid=273) INFO 04-24 10:27:20 [multiproc_executor.py:138] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2 WARNING 04-24 10:27:29 [interface.py:686] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. WARNING 04-24 10:27:29 [interface.py:686] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. (Worker pid=408) INFO 04-24 10:27:29 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:41607 backend=nccl (Worker pid=409) INFO 04-24 10:27:29 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:41607 backend=nccl (Worker pid=408) INFO 04-24 10:27:30 [pynccl.py:111] vLLM is using nccl==2.28.9 (Worker_TP1_EP1 pid=409) torch_dtype is deprecated! Use dtype instead! (Worker_TP0_EP0 pid=408) torch_dtype is deprecated! Use dtype instead! Loading safetensors checkpoint shards: 0% Completed | 0/46 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 2% Completed | 1/46 [00:06<04:41, 6.26s/it] (Worker pid=408) WARNING 04-24 10:27:37 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available. (Worker pid=409) WARNING 04-24 10:27:37 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available. (Worker pid=408) WARNING 04-24 10:27:37 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (Worker pid=409) WARNING 04-24 10:27:37 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (Worker pid=408) INFO 04-24 10:27:40 [parallel_state.py:1713] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:44 [gpu_model_runner.py:4763] Starting to load model /models/DeepSeek-V4-Flash... (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:45 [init.py:384] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:49 [deepseek_v4_attention.py:607] Using DeepSeek's fp8_ds_mla KV cache format. To use standard fp8 kv-cache format, please set --attention-backend FLASHINFER_MLA_SPARSE (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:49 [layer.py:400] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 128/256. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35, 36->36, 37->37, 38->38, 39->39, 40->40, 41->41, 42->42, 43->43, 44->44, 45->45, 46->46, 47->47, 48->48, 49->49, 50->50, 51->51, 52->52, 53->53, 54->54, 55->55, 56->56, 57->57, 58->58, 59->59, 60->60, 61->61, 62->62, 63->63, 64->64, 65->65, 66->66, 67->67, 68->68, 69->69, 70->70, 71->71, 72->72, 73->73, 74->74, 75->75, 76->76, 77->77, 78->78, 79->79, 80->80, 81->81, 82->82, 83->83, 84->84, 85->85, 86->86, 87->87, 88->88, 89->89, 90->90, 91->91, 92->92, 93->93, 94->94, 95->95, 96->96, 97->97, 98->98, 99->99, 100->100, 101->101, 102->102, 103->103, 104->104, 105->105, 106->106, 107->107, 108->108, 109->109, 110->110, 111->111, 112->112, 113->113, 114->114, 115->115, 116->116, 117->117, 118->118, 119->119, 120->120, 121->121, 122->122, 123->123, 124->124, 125->125, 126->126, 127->127. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:50 [mxfp4.py:481] Using 'MARLIN' Mxfp4 MoE backend. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:50 [deepseek_v4_attention.py:969] Using MXFP4 indexer cache for Lighening Indexer. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:54 [weight_utils.py:904] Filesystem type for checkpoints: 9P. Checkpoint size: 148.66 GiB. Available RAM: 0.65 GiB. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:54 [weight_utils.py:934] Auto-prefetch is disabled because the filesystem (9P) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (0.65 GiB). Loading safetensors checkpoint shards: 4% Completed | 2/46 [00:33<13:38, 18.60s/it] Loading safetensors checkpoint shards: 7% Completed | 3/46 [00:52<13:19, 18.59s/it] Loading safetensors checkpoint shards: 9% Completed | 4/46 [01:12<13:25, 19.18s/it] Loading safetensors checkpoint shards: 11% Completed | 5/46 [01:28<12:22, 18.12s/it] Loading safetensors checkpoint shards: 13% Completed | 6/46 [01:49<12:50, 19.27s/it] Loading safetensors checkpoint shards: 15% Completed | 7/46 [02:06<11:59, 18.44s/it] Loading safetensors checkpoint shards: 17% Completed | 8/46 [02:27<12:11, 19.26s/it] Loading safetensors checkpoint shards: 20% Completed | 9/46 [02:45<11:32, 18.71s/it] Loading safetensors checkpoint shards: 22% Completed | 10/46 [03:06<11:37, 19.38s/it] Loading safetensors checkpoint shards: 24% Completed | 11/46 [03:18<10:06, 17.32s/it] Loading safetensors checkpoint shards: 26% Completed | 12/46 [03:39<10:21, 18.28s/it] Loading safetensors checkpoint shards: 28% Completed | 13/46 [03:59<10:25, 18.94s/it] Loading safetensors checkpoint shards: 30% Completed | 14/46 [04:20<10:26, 19.58s/it] Loading safetensors checkpoint shards: 33% Completed | 15/46 [04:40<10:11, 19.74s/it] Loading safetensors checkpoint shards: 35% Completed | 16/46 [05:00<09:54, 19.80s/it] Loading safetensors checkpoint shards: 37% Completed | 17/46 [05:22<09:50, 20.36s/it] Loading safetensors checkpoint shards: 39% Completed | 18/46 [05:44<09:44, 20.88s/it] Loading safetensors checkpoint shards: 41% Completed | 19/46 [06:06<09:33, 21.23s/it] Loading safetensors checkpoint shards: 43% Completed | 20/46 [06:28<09:14, 21.32s/it] Loading safetensors checkpoint shards: 46% Completed | 21/46 [06:50<09:00, 21.61s/it] Loading safetensors checkpoint shards: 48% Completed | 22/46 [07:11<08:37, 21.57s/it] Loading safetensors checkpoint shards: 50% Completed | 23/46 [07:32<08:06, 21.16s/it] Loading safetensors checkpoint shards: 52% Completed | 24/46 [07:50<07:26, 20.29s/it] Loading safetensors checkpoint shards: 54% Completed | 25/46 [08:07<06:47, 19.43s/it] Loading safetensors checkpoint shards: 57% Completed | 26/46 [08:25<06:17, 18.85s/it] Loading safetensors checkpoint shards: 59% Completed | 27/46 [08:41<05:42, 18.03s/it] Loading safetensors checkpoint shards: 61% Completed | 28/46 [08:57<05:13, 17.44s/it] Loading safetensors checkpoint shards: 63% Completed | 29/46 [09:12<04:46, 16.82s/it] Loading safetensors checkpoint shards: 65% Completed | 30/46 [09:29<04:28, 16.78s/it] Loading safetensors checkpoint shards: 67% Completed | 31/46 [09:44<04:06, 16.40s/it] Loading safetensors checkpoint shards: 70% Completed | 32/46 [10:00<03:47, 16.22s/it] Loading safetensors checkpoint shards: 72% Completed | 33/46 [10:17<03:31, 16.30s/it] Loading safetensors checkpoint shards: 74% Completed | 34/46 [11:22<06:13, 31.10s/it] Loading safetensors checkpoint shards: 76% Completed | 35/46 [12:43<08:25, 45.97s/it] Loading safetensors checkpoint shards: 78% Completed | 36/46 [13:53<08:50, 53.06s/it] Loading safetensors checkpoint shards: 80% Completed | 37/46 [15:06<08:51, 59.02s/it] Loading safetensors checkpoint shards: 83% Completed | 38/46 [15:57<07:35, 56.89s/it] Loading safetensors checkpoint shards: 85% Completed | 39/46 [16:28<05:43, 49.12s/it] Loading safetensors checkpoint shards: 87% Completed | 40/46 [16:50<04:05, 40.94s/it] Loading safetensors checkpoint shards: 89% Completed | 41/46 [17:10<02:53, 34.65s/it] Loading safetensors checkpoint shards: 91% Completed | 42/46 [18:05<02:42, 40.72s/it] Loading safetensors checkpoint shards: 93% Completed | 43/46 [18:32<01:49, 36.66s/it] Loading safetensors checkpoint shards: 96% Completed | 44/46 [19:32<01:27, 43.69s/it] time="2026-04-24T18:49:16+08:00" level=error msg="error waiting for container: unexpected EOF" Loading safetensors checkpoint shards: 98% Completed | 45/46 [21:28<01:05, 65.38s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [21:30<00:00, 46.37s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [21:30<00:00, 28.07s/it] (Worker_TP0_EP0 pid=408) (Worker_TP0_EP0 pid=408) INFO 04-24 10:49:25 [default_loader.py:384] Loading weights took 1292.19 seconds (Worker_TP0_EP0 pid=408) INFO 04-24 10:49:27 [mxfp4.py:1238] Using MoEPrepareAndFinalizeNoDPEPModular (Worker_TP0_EP0 pid=408) INFO 04-24 10:50:00 [gpu_model_runner.py:4848] Model loading took 74.05 GiB memory and 1307.649939 seconds (Worker_TP0_EP0 pid=408) INFO 04-24 10:50:20 [backends.py:1070] Using cache directory: /root/.cache/vllm/torch_compile_cache/6df291e80d/rank_0_0/backbone for vLLM's torch.compile (Worker_TP0_EP0 pid=408) INFO 04-24 10:50:20 [backends.py:1130] Dynamo bytecode transform time: 19.33 s (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] WorkerProc hit an exception. (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.model_runner.profile_run() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states, last_hidden_states = self._dummy_run( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] outputs = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.runnable(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 611, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.aot_compiled_fn = self.aot_compile(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 183, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._compiled_callable.aot_compile((args, kwargs)) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 873, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return aot_compile_fullgraph( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/aot_compile.py", line 368, in aot_compile_fullgraph (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_fn = backend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/init.py", line 2535, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.compiler_fn(model, inputs, **self.kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/lib/python3.12/contextlib.py", line 81, in inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwds) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 1196, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] PiecewiseCompileInterpreter( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 722, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return super().run(*args) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 200, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.env[node] = self.run_node(node) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 297, in run_node (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return getattr(self, n.op)(n.target, args, kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 749, in call_module (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] piecewise_backend = PiecewiseBackend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 189, in init (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.compile_all_ranges() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 265, in compile_all_ranges (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] range_entry.runnable = self.vllm_backend.compiler_manager.compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 348, in compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_graph, handle = self.compiler.compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/compiler_interface.py", line 351, in compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_graph = standalone_compile(graph, example_inputs, **compile_kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/init.py", line 444, in standalone_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return standalone_compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 444, in standalone_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_fn = compile_fx( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2527, in compile_fx (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return compile_fx( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2578, in compile_fx (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return _maybe_wrap_and_compile_fx_main( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2655, in _maybe_wrap_and_compile_fx_main (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return _compile_fx_main( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2864, in _compile_fx_main (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1053, in _compile_fx_inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] raise InductorError(e, currentframe()).with_traceback( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1037, in _compile_fx_inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] mb_compiled_graph = fx_codegen_and_compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1798, in fx_codegen_and_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1344, in codegen_and_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] _recursive_post_grad_passes(gm, is_inference=is_inference) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 583, in _recursive_post_grad_passes (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] post_grad_passes(gm, is_inference) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/fx_passes/post_grad.py", line 358, in post_grad_passes (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] GraphTransformObserver(gm, "decompose_auto_functionalized").apply_graph_pass( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/passes/graph_transform_observer.py", line 103, in apply_graph_pass (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return pass_fn(self.gm.graph) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/fx_passes/post_grad.py", line 1392, in decompose_auto_functionalized (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] raise AssertionError("auto_functionalized was not removed") (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] torch._inductor.exc.InductorError: AssertionError: auto_functionalized was not removed (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.model_runner.profile_run() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states, last_hidden_states = self._dummy_run( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] outputs = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.runnable(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 611, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.aot_compiled_fn = self.aot_compile(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 183, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._compiled_callable.aot_compile((args, kwargs)) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 873, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return aot_compile_fullgraph( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/aot_compile.py", line 368, in aot_compile_fullgraph (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_fn = backend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/init.py", line 2535, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.compiler_fn(model, inputs, **self.kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/lib/python3.12/contextlib.py", line 81, in inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwds) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 1196, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] PiecewiseCompileInterpreter( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 722, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return super().run(*args) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 200, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.env[node] = self.run_node(node) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 297, in run_node (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return getattr(self, n.op)(n.target, args, kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 749, in call_module (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] piecewise_backend = PiecewiseBackend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 189, in init (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.compile_all_ranges() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 265, in compile_all_ranges (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] range_entry.runnable = self.vllm_backend.compiler_manager.compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

PS C:\Users\wuwen> docker run --gpus all --privileged --ipc=host -p 8005:8005 -e CUDA_DEVICE_ORDER=PCI_BUS_ID -e CUDA_VISIBLE_DEVICES=0,1 -e NCCL_CUMEM_ENABLE=0 -v I:\AI-Chat\models\Deepseek:/models vllm/vllm-openai:deepseekv4-cu130 /models/DeepSeek-V4-Flash --trust-remote-code --kv-cache-dtype fp8 --block-size 256 --enable-expert-parallel --tensor-parallel-size 2 --attention_config.use_fp4_indexer_cache=True --tokenizer-mode deepseek_v4 --tool-call-parser deepseek_v4 --enable-auto-tool-choice --reasoning-parser deepseek_v4 --host 0.0.0.0 --served-model-name VLLM-MODEL --gpu-memory-utilization 0.95 --async-scheduling --enable-prefix-caching --max-num-seqs 2 (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.1.dev15830+g8d599d76a (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] █▄█▀ █ █ █ █ model /models/DeepSeek-V4-Flash (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:299] (APIServer pid=1) INFO 04-24 10:27:00 [utils.py:233] non-default args: {'model_tag': '/models/DeepSeek-V4-Flash', 'enable_auto_tool_choice': True, 'tool_call_parser': 'deepseek_v4', 'host': '0.0.0.0', 'model': '/models/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'served_model_name': ['VLLM-MODEL'], 'reasoning_parser': 'deepseek_v4', 'tensor_parallel_size': 2, 'enable_expert_parallel': True, 'block_size': 256, 'gpu_memory_utilization': 0.95, 'kv_cache_dtype': 'fp8', 'enable_prefix_caching': True, 'max_num_seqs': 2, 'async_scheduling': True, 'attention_config': AttentionConfig(backend=None, flash_attn_version=None, use_prefill_decode_attention=False, flash_attn_max_num_splits_for_cuda_graph=32, use_cudnn_prefill=False, use_trtllm_ragged_deepseek_prefill=False, use_trtllm_attention=None, disable_flashinfer_prefill=True, disable_flashinfer_q_quantization=False, use_prefill_query_quantization=False, use_fp4_indexer_cache=True)} (APIServer pid=1) INFO 04-24 10:27:00 [config.py:763] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM. (APIServer pid=1) INFO 04-24 10:27:00 [config.py:449] Replacing legacy 'type' key with 'rope_type' (APIServer pid=1) INFO 04-24 10:27:11 [model.py:555] Resolved architecture: DeepseekV4ForCausalLM (APIServer pid=1) INFO 04-24 10:27:11 [model.py:1689] Using max model len 1048576 (APIServer pid=1) INFO 04-24 10:27:11 [cache.py:267] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor (APIServer pid=1) INFO 04-24 10:27:11 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192. (APIServer pid=1) INFO 04-24 10:27:11 [vllm.py:819] Asynchronous scheduling is enabled. (APIServer pid=1) INFO 04-24 10:27:11 [kernel.py:201] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['native']) (APIServer pid=1) WARNING 04-24 10:27:12 [vllm.py:1234] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implemented the reasoning_start_str and reasoning_end_str. (APIServer pid=1) INFO 04-24 10:27:12 [compilation.py:294] Enabled custom fusions: norm_quant, act_quant (EngineCore pid=273) INFO 04-24 10:27:20 [core.py:108] Initializing a V1 LLM engine (v0.1.dev15830+g8d599d76a) with config: model='/models/DeepSeek-V4-Flash', speculative_config=None, tokenizer='/models/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=1048576, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=VLLM-MODEL, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8'], 'ir_enable_torch_wrap': True, 'splitting_ops': ['vllm::unified_attention_with_output', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::deepseek_v4_attention', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 4, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['native']), enable_flashinfer_autotune=True, moe_backend='auto') (EngineCore pid=273) WARNING 04-24 10:27:20 [multiproc_executor.py:1038] Reducing Torch parallelism from 32 threads to 1 to avoid unnecessary CPU contention. Set OMP_NUM_THREADS in the external environment to tune this value as needed. (EngineCore pid=273) INFO 04-24 10:27:20 [multiproc_executor.py:138] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=172.17.0.2 (local), world_size=2, local_world_size=2 WARNING 04-24 10:27:29 [interface.py:686] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. WARNING 04-24 10:27:29 [interface.py:686] Using 'pin_memory=False' as WSL is detected. This may slow down the performance. (Worker pid=408) INFO 04-24 10:27:29 [parallel_state.py:1400] world_size=2 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:41607 backend=nccl (Worker pid=409) INFO 04-24 10:27:29 [parallel_state.py:1400] world_size=2 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:41607 backend=nccl (Worker pid=408) INFO 04-24 10:27:30 [pynccl.py:111] vLLM is using nccl==2.28.9 (Worker_TP1_EP1 pid=409) torch_dtype is deprecated! Use dtype instead! (Worker_TP0_EP0 pid=408) torch_dtype is deprecated! Use dtype instead! Loading safetensors checkpoint shards: 0% Completed | 0/46 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 2% Completed | 1/46 [00:06<04:41, 6.26s/it] (Worker pid=408) WARNING 04-24 10:27:37 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available. (Worker pid=409) WARNING 04-24 10:27:37 [symm_mem.py:66] SymmMemCommunicator: Device capability 12.0 not supported, communicator is not available. (Worker pid=408) WARNING 04-24 10:27:37 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (Worker pid=409) WARNING 04-24 10:27:37 [custom_all_reduce.py:164] Custom allreduce is disabled because your platform lacks GPU P2P capability or P2P test failed. To silence this warning, specify disable_custom_all_reduce=True explicitly. (Worker pid=408) INFO 04-24 10:27:40 [parallel_state.py:1713] rank 0 in world size 2 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:44 [gpu_model_runner.py:4763] Starting to load model /models/DeepSeek-V4-Flash... (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:45 [init.py:384] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:49 [deepseek_v4_attention.py:607] Using DeepSeek's fp8_ds_mla KV cache format. To use standard fp8 kv-cache format, please set --attention-backend FLASHINFER_MLA_SPARSE (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:49 [layer.py:400] [EP Rank 0/2] Expert parallelism is enabled. Expert placement strategy: linear. Local/global number of experts: 128/256. Experts local to global index map: 0->0, 1->1, 2->2, 3->3, 4->4, 5->5, 6->6, 7->7, 8->8, 9->9, 10->10, 11->11, 12->12, 13->13, 14->14, 15->15, 16->16, 17->17, 18->18, 19->19, 20->20, 21->21, 22->22, 23->23, 24->24, 25->25, 26->26, 27->27, 28->28, 29->29, 30->30, 31->31, 32->32, 33->33, 34->34, 35->35, 36->36, 37->37, 38->38, 39->39, 40->40, 41->41, 42->42, 43->43, 44->44, 45->45, 46->46, 47->47, 48->48, 49->49, 50->50, 51->51, 52->52, 53->53, 54->54, 55->55, 56->56, 57->57, 58->58, 59->59, 60->60, 61->61, 62->62, 63->63, 64->64, 65->65, 66->66, 67->67, 68->68, 69->69, 70->70, 71->71, 72->72, 73->73, 74->74, 75->75, 76->76, 77->77, 78->78, 79->79, 80->80, 81->81, 82->82, 83->83, 84->84, 85->85, 86->86, 87->87, 88->88, 89->89, 90->90, 91->91, 92->92, 93->93, 94->94, 95->95, 96->96, 97->97, 98->98, 99->99, 100->100, 101->101, 102->102, 103->103, 104->104, 105->105, 106->106, 107->107, 108->108, 109->109, 110->110, 111->111, 112->112, 113->113, 114->114, 115->115, 116->116, 117->117, 118->118, 119->119, 120->120, 121->121, 122->122, 123->123, 124->124, 125->125, 126->126, 127->127. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:50 [mxfp4.py:481] Using 'MARLIN' Mxfp4 MoE backend. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:50 [deepseek_v4_attention.py:969] Using MXFP4 indexer cache for Lighening Indexer. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:54 [weight_utils.py:904] Filesystem type for checkpoints: 9P. Checkpoint size: 148.66 GiB. Available RAM: 0.65 GiB. (Worker_TP0_EP0 pid=408) INFO 04-24 10:27:54 [weight_utils.py:934] Auto-prefetch is disabled because the filesystem (9P) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (0.65 GiB). Loading safetensors checkpoint shards: 4% Completed | 2/46 [00:33<13:38, 18.60s/it] Loading safetensors checkpoint shards: 7% Completed | 3/46 [00:52<13:19, 18.59s/it] Loading safetensors checkpoint shards: 9% Completed | 4/46 [01:12<13:25, 19.18s/it] Loading safetensors checkpoint shards: 11% Completed | 5/46 [01:28<12:22, 18.12s/it] Loading safetensors checkpoint shards: 13% Completed | 6/46 [01:49<12:50, 19.27s/it] Loading safetensors checkpoint shards: 15% Completed | 7/46 [02:06<11:59, 18.44s/it] Loading safetensors checkpoint shards: 17% Completed | 8/46 [02:27<12:11, 19.26s/it] Loading safetensors checkpoint shards: 20% Completed | 9/46 [02:45<11:32, 18.71s/it] Loading safetensors checkpoint shards: 22% Completed | 10/46 [03:06<11:37, 19.38s/it] Loading safetensors checkpoint shards: 24% Completed | 11/46 [03:18<10:06, 17.32s/it] Loading safetensors checkpoint shards: 26% Completed | 12/46 [03:39<10:21, 18.28s/it] Loading safetensors checkpoint shards: 28% Completed | 13/46 [03:59<10:25, 18.94s/it] Loading safetensors checkpoint shards: 30% Completed | 14/46 [04:20<10:26, 19.58s/it] Loading safetensors checkpoint shards: 33% Completed | 15/46 [04:40<10:11, 19.74s/it] Loading safetensors checkpoint shards: 35% Completed | 16/46 [05:00<09:54, 19.80s/it] Loading safetensors checkpoint shards: 37% Completed | 17/46 [05:22<09:50, 20.36s/it] Loading safetensors checkpoint shards: 39% Completed | 18/46 [05:44<09:44, 20.88s/it] Loading safetensors checkpoint shards: 41% Completed | 19/46 [06:06<09:33, 21.23s/it] Loading safetensors checkpoint shards: 43% Completed | 20/46 [06:28<09:14, 21.32s/it] Loading safetensors checkpoint shards: 46% Completed | 21/46 [06:50<09:00, 21.61s/it] Loading safetensors checkpoint shards: 48% Completed | 22/46 [07:11<08:37, 21.57s/it] Loading safetensors checkpoint shards: 50% Completed | 23/46 [07:32<08:06, 21.16s/it] Loading safetensors checkpoint shards: 52% Completed | 24/46 [07:50<07:26, 20.29s/it] Loading safetensors checkpoint shards: 54% Completed | 25/46 [08:07<06:47, 19.43s/it] Loading safetensors checkpoint shards: 57% Completed | 26/46 [08:25<06:17, 18.85s/it] Loading safetensors checkpoint shards: 59% Completed | 27/46 [08:41<05:42, 18.03s/it] Loading safetensors checkpoint shards: 61% Completed | 28/46 [08:57<05:13, 17.44s/it] Loading safetensors checkpoint shards: 63% Completed | 29/46 [09:12<04:46, 16.82s/it] Loading safetensors checkpoint shards: 65% Completed | 30/46 [09:29<04:28, 16.78s/it] Loading safetensors checkpoint shards: 67% Completed | 31/46 [09:44<04:06, 16.40s/it] Loading safetensors checkpoint shards: 70% Completed | 32/46 [10:00<03:47, 16.22s/it] Loading safetensors checkpoint shards: 72% Completed | 33/46 [10:17<03:31, 16.30s/it] Loading safetensors checkpoint shards: 74% Completed | 34/46 [11:22<06:13, 31.10s/it] Loading safetensors checkpoint shards: 76% Completed | 35/46 [12:43<08:25, 45.97s/it] Loading safetensors checkpoint shards: 78% Completed | 36/46 [13:53<08:50, 53.06s/it] Loading safetensors checkpoint shards: 80% Completed | 37/46 [15:06<08:51, 59.02s/it] Loading safetensors checkpoint shards: 83% Completed | 38/46 [15:57<07:35, 56.89s/it] Loading safetensors checkpoint shards: 85% Completed | 39/46 [16:28<05:43, 49.12s/it] Loading safetensors checkpoint shards: 87% Completed | 40/46 [16:50<04:05, 40.94s/it] Loading safetensors checkpoint shards: 89% Completed | 41/46 [17:10<02:53, 34.65s/it] Loading safetensors checkpoint shards: 91% Completed | 42/46 [18:05<02:42, 40.72s/it] Loading safetensors checkpoint shards: 93% Completed | 43/46 [18:32<01:49, 36.66s/it] Loading safetensors checkpoint shards: 96% Completed | 44/46 [19:32<01:27, 43.69s/it] time="2026-04-24T18:49:16+08:00" level=error msg="error waiting for container: unexpected EOF" Loading safetensors checkpoint shards: 98% Completed | 45/46 [21:28<01:05, 65.38s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [21:30<00:00, 46.37s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [21:30<00:00, 28.07s/it] (Worker_TP0_EP0 pid=408) (Worker_TP0_EP0 pid=408) INFO 04-24 10:49:25 [default_loader.py:384] Loading weights took 1292.19 seconds (Worker_TP0_EP0 pid=408) INFO 04-24 10:49:27 [mxfp4.py:1238] Using MoEPrepareAndFinalizeNoDPEPModular (Worker_TP0_EP0 pid=408) INFO 04-24 10:50:00 [gpu_model_runner.py:4848] Model loading took 74.05 GiB memory and 1307.649939 seconds (Worker_TP0_EP0 pid=408) INFO 04-24 10:50:20 [backends.py:1070] Using cache directory: /root/.cache/vllm/torch_compile_cache/6df291e80d/rank_0_0/backbone for vLLM's torch.compile (Worker_TP0_EP0 pid=408) INFO 04-24 10:50:20 [backends.py:1130] Dynamo bytecode transform time: 19.33 s (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] WorkerProc hit an exception. (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.model_runner.profile_run() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states, last_hidden_states = self._dummy_run( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] outputs = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.runnable(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 611, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.aot_compiled_fn = self.aot_compile(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 183, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._compiled_callable.aot_compile((args, kwargs)) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 873, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return aot_compile_fullgraph( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/aot_compile.py", line 368, in aot_compile_fullgraph (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_fn = backend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/init.py", line 2535, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.compiler_fn(model, inputs, **self.kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/lib/python3.12/contextlib.py", line 81, in inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwds) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 1196, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] PiecewiseCompileInterpreter( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 722, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return super().run(*args) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 200, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.env[node] = self.run_node(node) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 297, in run_node (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return getattr(self, n.op)(n.target, args, kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 749, in call_module (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] piecewise_backend = PiecewiseBackend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 189, in init (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.compile_all_ranges() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 265, in compile_all_ranges (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] range_entry.runnable = self.vllm_backend.compiler_manager.compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 348, in compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_graph, handle = self.compiler.compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/compiler_interface.py", line 351, in compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_graph = standalone_compile(graph, example_inputs, **compile_kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/init.py", line 444, in standalone_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return standalone_compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/standalone_compile.py", line 444, in standalone_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_fn = compile_fx( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2527, in compile_fx (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return compile_fx( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2578, in compile_fx (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return _maybe_wrap_and_compile_fx_main( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2655, in _maybe_wrap_and_compile_fx_main (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return _compile_fx_main( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 2864, in _compile_fx_main (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] raise e.remove_dynamo_frames() from None # see TORCHDYNAMO_VERBOSE=1 (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1053, in _compile_fx_inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] raise InductorError(e, currentframe()).with_traceback( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1037, in _compile_fx_inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] mb_compiled_graph = fx_codegen_and_compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1798, in fx_codegen_and_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return scheme.codegen_and_compile(gm, example_inputs, inputs_to_check, graph_kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 1344, in codegen_and_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] _recursive_post_grad_passes(gm, is_inference=is_inference) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/compile_fx.py", line 583, in _recursive_post_grad_passes (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] post_grad_passes(gm, is_inference) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/fx_passes/post_grad.py", line 358, in post_grad_passes (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] GraphTransformObserver(gm, "decompose_auto_functionalized").apply_graph_pass( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/passes/graph_transform_observer.py", line 103, in apply_graph_pass (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return pass_fn(self.gm.graph) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_inductor/fx_passes/post_grad.py", line 1392, in decompose_auto_functionalized (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] raise AssertionError("auto_functionalized was not removed") (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] torch._inductor.exc.InductorError: AssertionError: auto_functionalized was not removed (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 370, in determine_available_memory (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.model_runner.profile_run() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5826, in profile_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states, last_hidden_states = self._dummy_run( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 5514, in _dummy_run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] outputs = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.runnable(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1790, in _call_impl (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v4.py", line 833, in forward (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] hidden_states = self.model( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 611, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.aot_compiled_fn = self.aot_compile(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/wrapper.py", line 183, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self._compiled_callable.aot_compile((args, kwargs)) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/eval_frame.py", line 873, in aot_compile (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return aot_compile_fullgraph( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/dynamo/aot_compile.py", line 368, in aot_compile_fullgraph (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] compiled_fn = backend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/init.py", line 2535, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return self.compiler_fn(model, inputs, **self.kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/lib/python3.12/contextlib.py", line 81, in inner (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwds) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 1196, in call (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] PiecewiseCompileInterpreter( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 722, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return super().run(*args) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 200, in run (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.env[node] = self.run_node(node) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/fx/interpreter.py", line 297, in run_node (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] return getattr(self, n.op)(n.target, args, kwargs) (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/backends.py", line 749, in call_module (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] piecewise_backend = PiecewiseBackend( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^ (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 189, in init (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] self.compile_all_ranges() (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/piecewise_backend.py", line 265, in compile_all_ranges (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] range_entry.runnable = self.vllm_backend.compiler_manager.compile( (Worker_TP0_EP0 pid=408) ERROR 04-24 10:50:25 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The error message torch._inductor.exc.InductorError: AssertionError: auto_functionalized was not removed suggests a compilation issue with the PyTorch model, potentially related to the torch._dynamo or torch._inductor modules.

Guidance

  1. Check PyTorch and TorchDynamo versions: Ensure you are using compatible versions of PyTorch and TorchDynamo, as version mismatches can lead to compilation errors.
  2. Disable TorchDynamo: Try disabling TorchDynamo by setting --enable_torch_dynamo=False or torch._dynamo.config.verify_correctness = False to see if the issue persists.
  3. Update TorchDynamo: If you are using an older version of TorchDynamo, try updating to the latest version, as newer versions may have fixed compilation issues.
  4. Check model architecture: Verify that the model architecture is compatible with TorchDynamo and PyTorch, and that there are no custom modules or functions that could be causing the compilation issue.

Example

No specific code example can be provided without more context about the model architecture and the exact commands being run. However, you can try setting torch._dynamo.config.verify_correctness = False before running your model to potentially bypass the compilation error.

Notes

This error may be related to a known issue in TorchDynamo, and updating to the latest version or disabling TorchDynamo may resolve the issue. However, without more information about the model and the exact commands being run, it is difficult to provide a more specific solution.

Recommendation

Apply a workaround by disabling TorchDynamo or updating to the latest version, as this may resolve the compilation issue. If the issue persists, further debugging and investigation into the model architecture and PyTorch/TorchDynamo versions may be necessary.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING