vllm - 💡(How to fix) Fix [Bug]: deepseek v4 failed to work on R6000 GPU

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

(deepseekv4vllm) root@ubuntugpu-h2-96G-01:/wayne/deepseekv4vllm/vllm# CUDA_VISIBLE_DEVICES=0 VLLM_ENGINE_READY_TIMEOUT_S=3600 vllm serve deepseek-ai/DeepSeek-V4-Flash --host 0.0.0.0 --port 8000 --trust-remote-code --tensor-parallel-size 1 --max-model-len 4096 --kv-cache-dtype fp8 --block-size 256 --gpu-memory-utilization 0.70 --cpu-offload-gb 128 --max-num-seqs 1 --max-num-batched-tokens 1024 --enforce-eager --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] █ █ █▄ ▄█ (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.2rc1.dev254+ge1c8776e9 (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] █▄█▀ █ █ █ █ model deepseek-ai/DeepSeek-V4-Flash (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:240] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-V4-Flash', 'host': '0.0.0.0', 'model': 'deepseek-ai/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'max_model_len': 4096, 'enforce_eager': True, 'reasoning_parser': 'deepseek_v4', 'block_size': 256, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'cpu_offload_gb': 128.0, 'max_num_batched_tokens': 1024, 'max_num_seqs': 1} (APIServer pid=396885) INFO 05-12 14:40:18 [config.py:800] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM. (APIServer pid=396885) INFO 05-12 14:40:19 [model.py:568] Resolved architecture: DeepseekV4ForCausalLM (APIServer pid=396885) INFO 05-12 14:40:19 [model.py:1697] Using max model len 4096 (APIServer pid=396885) INFO 05-12 14:40:20 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor (APIServer pid=396885) INFO 05-12 14:40:20 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024. (APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:899] Asynchronous scheduling is enabled. (APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:955] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:973] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=396885) INFO 05-12 14:40:20 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']) (APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:1148] Cudagraph is disabled under eager mode (APIServer pid=396885) WARNING 05-12 14:40:21 [vllm.py:1313] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implementedthe reasoning_start_str and reasoning_end_str. (APIServer pid=396885) INFO 05-12 14:40:21 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant (EngineCore pid=396927) INFO 05-12 14:40:26 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev254+ge1c8776e9) with config: model='deepseek-ai/DeepSeek-V4-Flash', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-V4-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto') (EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.74.124.32:44933 backend=nccl (EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (EngineCore pid=396927) INFO 05-12 14:40:27 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling. (EngineCore pid=396927) INFO 05-12 14:40:27 [base.py:123] Offloader set to UVAOffloader (EngineCore pid=396927) INFO 05-12 14:40:27 [gpu_model_runner.py:4863] Starting to load model deepseek-ai/DeepSeek-V4-Flash... (EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4.py:168] DeepSeek V4 expert_dtype resolved to 'fp4' (EngineCore pid=396927) INFO 05-12 14:40:28 [init.py:393] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod (EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4_attention.py:710] Using DeepSeek's fp8_ds_mla KV cache format. (EngineCore pid=396927) INFO 05-12 14:40:28 [mxfp4.py:551] Using 'MARLIN' Mxfp4 MoE backend. (EngineCore pid=396927) INFO 05-12 14:40:34 [deepseek_v4_attention.py:1092] Using FP8 indexer cache for Lightning Indexer. (EngineCore pid=396927) INFO 05-12 14:43:27 [uva.py:58] Total CPU offloaded parameters: 128.8 (EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:938] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 114.56 GiB. (EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:968] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (114.56 GiB). Loading safetensors checkpoint shards: 0% Completed | 0/46 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 2% Completed | 1/46 [00:02<01:32, 2.06s/it] Loading safetensors checkpoint shards: 4% Completed | 2/46 [00:09<03:55, 5.35s/it] Loading safetensors checkpoint shards: 7% Completed | 3/46 [00:17<04:32, 6.34s/it] Loading safetensors checkpoint shards: 9% Completed | 4/46 [00:24<04:47, 6.85s/it] Loading safetensors checkpoint shards: 11% Completed | 5/46 [00:32<04:48, 7.03s/it] Loading safetensors checkpoint shards: 13% Completed | 6/46 [00:39<04:44, 7.12s/it] Loading safetensors checkpoint shards: 15% Completed | 7/46 [00:46<04:39, 7.16s/it] Loading safetensors checkpoint shards: 17% Completed | 8/46 [00:49<03:43, 5.89s/it] Loading safetensors checkpoint shards: 20% Completed | 9/46 [00:53<03:11, 5.17s/it] Loading safetensors checkpoint shards: 22% Completed | 10/46 [00:56<02:45, 4.59s/it] Loading safetensors checkpoint shards: 24% Completed | 11/46 [01:00<02:26, 4.19s/it] Loading safetensors checkpoint shards: 26% Completed | 12/46 [01:03<02:14, 3.95s/it] Loading safetensors checkpoint shards: 28% Completed | 13/46 [01:07<02:11, 3.99s/it] Loading safetensors checkpoint shards: 30% Completed | 14/46 [01:11<02:06, 3.95s/it] Loading safetensors checkpoint shards: 33% Completed | 15/46 [01:16<02:15, 4.37s/it] Loading safetensors checkpoint shards: 35% Completed | 16/46 [01:22<02:19, 4.65s/it] Loading safetensors checkpoint shards: 37% Completed | 17/46 [01:27<02:21, 4.87s/it] Loading safetensors checkpoint shards: 39% Completed | 18/46 [01:32<02:14, 4.80s/it] Loading safetensors checkpoint shards: 41% Completed | 19/46 [01:36<02:02, 4.55s/it] Loading safetensors checkpoint shards: 43% Completed | 20/46 [01:40<02:00, 4.64s/it] Loading safetensors checkpoint shards: 46% Completed | 21/46 [01:44<01:49, 4.38s/it] Loading safetensors checkpoint shards: 48% Completed | 22/46 [01:48<01:41, 4.21s/it] Loading safetensors checkpoint shards: 50% Completed | 23/46 [01:53<01:39, 4.33s/it] Loading safetensors checkpoint shards: 52% Completed | 24/46 [01:57<01:35, 4.35s/it] Loading safetensors checkpoint shards: 54% Completed | 25/46 [02:01<01:30, 4.32s/it] Loading safetensors checkpoint shards: 57% Completed | 26/46 [02:05<01:25, 4.27s/it] Loading safetensors checkpoint shards: 59% Completed | 27/46 [02:10<01:24, 4.44s/it] Loading safetensors checkpoint shards: 61% Completed | 28/46 [02:15<01:22, 4.56s/it] Loading safetensors checkpoint shards: 63% Completed | 29/46 [02:20<01:17, 4.57s/it] Loading safetensors checkpoint shards: 65% Completed | 30/46 [02:24<01:10, 4.39s/it] Loading safetensors checkpoint shards: 67% Completed | 31/46 [02:27<01:02, 4.15s/it] Loading safetensors checkpoint shards: 70% Completed | 32/46 [02:32<01:00, 4.35s/it] Loading safetensors checkpoint shards: 72% Completed | 33/46 [02:36<00:56, 4.38s/it] Loading safetensors checkpoint shards: 74% Completed | 34/46 [02:40<00:50, 4.22s/it] Loading safetensors checkpoint shards: 76% Completed | 35/46 [02:45<00:47, 4.36s/it] Loading safetensors checkpoint shards: 78% Completed | 36/46 [02:55<01:01, 6.17s/it] Loading safetensors checkpoint shards: 80% Completed | 37/46 [03:00<00:52, 5.79s/it] Loading safetensors checkpoint shards: 83% Completed | 38/46 [03:06<00:45, 5.68s/it] Loading safetensors checkpoint shards: 85% Completed | 39/46 [03:11<00:39, 5.66s/it] Loading safetensors checkpoint shards: 87% Completed | 40/46 [03:17<00:33, 5.53s/it] Loading safetensors checkpoint shards: 89% Completed | 41/46 [03:21<00:26, 5.30s/it] Loading safetensors checkpoint shards: 91% Completed | 42/46 [03:27<00:21, 5.37s/it] Loading safetensors checkpoint shards: 93% Completed | 43/46 [03:31<00:14, 4.88s/it] Loading safetensors checkpoint shards: 96% Completed | 44/46 [03:34<00:08, 4.43s/it] Loading safetensors checkpoint shards: 98% Completed | 45/46 [03:36<00:03, 3.73s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00, 3.01s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00, 4.74s/it] (EngineCore pid=396927) (EngineCore pid=396927) INFO 05-12 14:47:07 [default_loader.py:397] Loading weights took 217.96 seconds (EngineCore pid=396927) INFO 05-12 14:47:11 [mxfp4.py:1497] Using MoEPrepareAndFinalizeNoDPEPModular (EngineCore pid=396927) INFO 05-12 14:50:19 [gpu_model_runner.py:4965] Model loading took 17.03 GiB memory and 571.286208 seconds (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] EngineCore failed to start. (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] Traceback (most recent call last): (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in init (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] super().init( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in init (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self.collective_rpc("determine_available_memory") (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] self.model_runner.profile_run() (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5954, in profile_run (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] hidden_states, last_hidden_states = self._dummy_run( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] outputs = self.model( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return forward_call(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] hidden_states = self.model( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in call (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self.forward(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] hidden_states, residual, post_mix, res_mix = layer( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return forward_call(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] x, post_mix, res_mix = self.hc_pre( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in call (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self._op(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre (EngineCore pid=396927) ERROR (EngineCore pid=396927) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=396927) self._target(*self._args, **self._kwargs) (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1144, in run_engine_core (EngineCore pid=396927) raise e (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=396927) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) return func(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in init (EngineCore pid=396927) super().init( (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in init (EngineCore pid=396927) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) return func(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches (EngineCore pid=396927) available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory (EngineCore pid=396927) return self.collective_rpc("determine_available_memory") (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc (EngineCore pid=396927) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=396927) return func(*args, **kwarg (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run (EngineCore pid=396927) outputs = self.model( (EngineCore pid=396927) ^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) return forward_call(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward (EngineCore pid=396927) hidden_states = self.model( (EngineCore pid=396927) ^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in call (EngineCore pid=396927) return self.forward(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward (EngineCore pid=396927) hidden_states, residual, post_mix, res_mix = layer( (EngineCore pid=396927) ^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) return forward_call(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward (EngineCore pid=396927) x, post_mix, res_mix = self.hc_pre( (EngineCore pid=396927) ^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre (EngineCore pid=396927) post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre( (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in call (EngineCore pid=396927) return self._op(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre (EngineCore pid=396927) tf32_hc_prenorm_gemm( (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm (EngineCore pid=396927) return _tf32_hc_prenorm_gemm_impl( (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) RuntimeError: Assertion error (csrc/apis/hyperconnection.hpp:56): Unsupported architecture [rank0]:[W512 14:50:24.024626144 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=396885) Traceback (most recent call last): (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/bin/vllm", line 8, in <module> (APIServer pid=396885) sys.exit(main()) (APIServer pid=396885) ^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=396885) args.dispatch_function(args) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=396885) uvloop.run(run_server(args)) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/init.py", line 96, in run (APIServer pid=396885) return __asyncio.run( (APIServer pid=396885) ^^^^^^^^^^^^^^ (APIServer pid=396885) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run (APIServer pid=396885) return runner.run(main) (APIServer pid=396885) ^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=396885) return self._loop.run_until_complete(task) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=396885) return await main (APIServer pid=396885) ^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 679, in run_server (APIServer pid=396885) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 693, in run_server_worker (APIServer pid=396885) async with build_async_engine_client( (APIServer pid=396885) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=396885) return await anext(self.gen) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=396885) async with build_async_engine_client_from_engine_args( (APIServer pid=396885) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=396885) return await anext(self.gen) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=396885) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config (APIServer pid=396885) return cls( (APIServer pid=396885) ^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 146, in init (APIServer pid=396885) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=396885) return func(*args, **kwargs) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=396885) return AsyncMPClient(*client_args) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=396885) return func(*args, **kwargs) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 900, in init (APIServer pid=396885) super().init( (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 535, in init (APIServer pid=396885) with launch_core_engines( (APIServer pid=396885) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=396885) next(self.gen) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1133, in launch_core_engines (APIServer pid=396885) wait_for_engine_startup( (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1192, in wait_for_engine_startup (APIServer pid=396885) raise RuntimeError( (APIServer pid=396885) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} (deepseekv4vllm) root@ubuntugpu-h2-96G-01:/wayne/deepseekv4vllm/vllm#

Root Cause

here is error log

(deepseekv4vllm) root@ubuntugpu-h2-96G-01:~/wayne/deepseekv4vllm/vllm# CUDA_VISIBLE_DEVICES=0 VLLM_ENGINE_READY_TIMEOUT_S=3600 vllm serve deepseek-ai/DeepSeek-V4-Flash   --host 0.0.0.0   --port 8000   --trust-remote-code   --tensor-parallel-size 1   --max-model-len 4096   --kv-cache-dtype fp8   --block-size 256   --gpu-memory-utilization 0.70   --cpu-offload-gb 128     --max-num-seqs 1   --max-num-batched-tokens 1024   --enforce-eager   --tokenizer-mode deepseek_v4   --reasoning-parser deepseek_v4
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev254+ge1c8776e9
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-V4-Flash
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:240] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-V4-Flash', 'host': '0.0.0.0', 'model': 'deepseek-ai/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'max_model_len': 4096, 'enforce_eager': True, 'reasoning_parser': 'deepseek_v4', 'block_size': 256, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'cpu_offload_gb': 128.0, 'max_num_batched_tokens': 1024, 'max_num_seqs': 1}
(APIServer pid=396885) INFO 05-12 14:40:18 [config.py:800] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM.
(APIServer pid=396885) INFO 05-12 14:40:19 [model.py:568] Resolved architecture: DeepseekV4ForCausalLM
(APIServer pid=396885) INFO 05-12 14:40:19 [model.py:1697] Using max model len 4096
(APIServer pid=396885) INFO 05-12 14:40:20 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=396885) INFO 05-12 14:40:20 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024.
(APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:899] Asynchronous scheduling is enabled.
(APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:955] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:973] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=396885) INFO 05-12 14:40:20 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:1148] Cudagraph is disabled under eager mode
(APIServer pid=396885) WARNING 05-12 14:40:21 [vllm.py:1313] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implementedthe `reasoning_start_str` and `reasoning_end_str`.
(APIServer pid=396885) INFO 05-12 14:40:21 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=396927) INFO 05-12 14:40:26 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev254+ge1c8776e9) with config: model='deepseek-ai/DeepSeek-V4-Flash', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-V4-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.74.124.32:44933 backend=nccl
(EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=396927) INFO 05-12 14:40:27 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=396927) INFO 05-12 14:40:27 [base.py:123] Offloader set to UVAOffloader
(EngineCore pid=396927) INFO 05-12 14:40:27 [gpu_model_runner.py:4863] Starting to load model deepseek-ai/DeepSeek-V4-Flash...
(EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4.py:168] DeepSeek V4 expert_dtype resolved to 'fp4'
(EngineCore pid=396927) INFO 05-12 14:40:28 [__init__.py:393] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod
(EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4_attention.py:710] Using DeepSeek's fp8_ds_mla KV cache format.
(EngineCore pid=396927) INFO 05-12 14:40:28 [mxfp4.py:551] Using 'MARLIN' Mxfp4 MoE backend.
(EngineCore pid=396927) INFO 05-12 14:40:34 [deepseek_v4_attention.py:1092] Using FP8 indexer cache for Lightning Indexer.
(EngineCore pid=396927) INFO 05-12 14:43:27 [uva.py:58] Total CPU offloaded parameters: 128.8
(EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:938] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 114.56 GiB.
(EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:968] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (114.56 GiB).
Loading safetensors checkpoint shards:   0% Completed | 0/46 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/46 [00:02<01:32,  2.06s/it]
Loading safetensors checkpoint shards:   4% Completed | 2/46 [00:09<03:55,  5.35s/it]
Loading safetensors checkpoint shards:   7% Completed | 3/46 [00:17<04:32,  6.34s/it]
Loading safetensors checkpoint shards:   9% Completed | 4/46 [00:24<04:47,  6.85s/it]
Loading safetensors checkpoint shards:  11% Completed | 5/46 [00:32<04:48,  7.03s/it]
Loading safetensors checkpoint shards:  13% Completed | 6/46 [00:39<04:44,  7.12s/it]
Loading safetensors checkpoint shards:  15% Completed | 7/46 [00:46<04:39,  7.16s/it]
Loading safetensors checkpoint shards:  17% Completed | 8/46 [00:49<03:43,  5.89s/it]
Loading safetensors checkpoint shards:  20% Completed | 9/46 [00:53<03:11,  5.17s/it]
Loading safetensors checkpoint shards:  22% Completed | 10/46 [00:56<02:45,  4.59s/it]
Loading safetensors checkpoint shards:  24% Completed | 11/46 [01:00<02:26,  4.19s/it]
Loading safetensors checkpoint shards:  26% Completed | 12/46 [01:03<02:14,  3.95s/it]
Loading safetensors checkpoint shards:  28% Completed | 13/46 [01:07<02:11,  3.99s/it]
Loading safetensors checkpoint shards:  30% Completed | 14/46 [01:11<02:06,  3.95s/it]
Loading safetensors checkpoint shards:  33% Completed | 15/46 [01:16<02:15,  4.37s/it]
Loading safetensors checkpoint shards:  35% Completed | 16/46 [01:22<02:19,  4.65s/it]
Loading safetensors checkpoint shards:  37% Completed | 17/46 [01:27<02:21,  4.87s/it]
Loading safetensors checkpoint shards:  39% Completed | 18/46 [01:32<02:14,  4.80s/it]
Loading safetensors checkpoint shards:  41% Completed | 19/46 [01:36<02:02,  4.55s/it]
Loading safetensors checkpoint shards:  43% Completed | 20/46 [01:40<02:00,  4.64s/it]
Loading safetensors checkpoint shards:  46% Completed | 21/46 [01:44<01:49,  4.38s/it]
Loading safetensors checkpoint shards:  48% Completed | 22/46 [01:48<01:41,  4.21s/it]
Loading safetensors checkpoint shards:  50% Completed | 23/46 [01:53<01:39,  4.33s/it]
Loading safetensors checkpoint shards:  52% Completed | 24/46 [01:57<01:35,  4.35s/it]
Loading safetensors checkpoint shards:  54% Completed | 25/46 [02:01<01:30,  4.32s/it]
Loading safetensors checkpoint shards:  57% Completed | 26/46 [02:05<01:25,  4.27s/it]
Loading safetensors checkpoint shards:  59% Completed | 27/46 [02:10<01:24,  4.44s/it]
Loading safetensors checkpoint shards:  61% Completed | 28/46 [02:15<01:22,  4.56s/it]
Loading safetensors checkpoint shards:  63% Completed | 29/46 [02:20<01:17,  4.57s/it]
Loading safetensors checkpoint shards:  65% Completed | 30/46 [02:24<01:10,  4.39s/it]
Loading safetensors checkpoint shards:  67% Completed | 31/46 [02:27<01:02,  4.15s/it]
Loading safetensors checkpoint shards:  70% Completed | 32/46 [02:32<01:00,  4.35s/it]
Loading safetensors checkpoint shards:  72% Completed | 33/46 [02:36<00:56,  4.38s/it]
Loading safetensors checkpoint shards:  74% Completed | 34/46 [02:40<00:50,  4.22s/it]
Loading safetensors checkpoint shards:  76% Completed | 35/46 [02:45<00:47,  4.36s/it]
Loading safetensors checkpoint shards:  78% Completed | 36/46 [02:55<01:01,  6.17s/it]
Loading safetensors checkpoint shards:  80% Completed | 37/46 [03:00<00:52,  5.79s/it]
Loading safetensors checkpoint shards:  83% Completed | 38/46 [03:06<00:45,  5.68s/it]
Loading safetensors checkpoint shards:  85% Completed | 39/46 [03:11<00:39,  5.66s/it]
Loading safetensors checkpoint shards:  87% Completed | 40/46 [03:17<00:33,  5.53s/it]
Loading safetensors checkpoint shards:  89% Completed | 41/46 [03:21<00:26,  5.30s/it]
Loading safetensors checkpoint shards:  91% Completed | 42/46 [03:27<00:21,  5.37s/it]
Loading safetensors checkpoint shards:  93% Completed | 43/46 [03:31<00:14,  4.88s/it]
Loading safetensors checkpoint shards:  96% Completed | 44/46 [03:34<00:08,  4.43s/it]
Loading safetensors checkpoint shards:  98% Completed | 45/46 [03:36<00:03,  3.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00,  3.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00,  4.74s/it]
(EngineCore pid=396927)
(EngineCore pid=396927) INFO 05-12 14:47:07 [default_loader.py:397] Loading weights took 217.96 seconds
(EngineCore pid=396927) INFO 05-12 14:47:11 [mxfp4.py:1497] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=396927) INFO 05-12 14:50:19 [gpu_model_runner.py:4965] Model loading took 17.03 GiB memory and 571.286208 seconds
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] EngineCore failed to start.
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] Traceback (most recent call last):
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     super().__init__(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     self.model_runner.profile_run()
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5954, in profile_run
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     outputs = self.model(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]               ^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return forward_call(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states = self.model(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                     ^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in __call__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self.forward(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states, residual, post_mix, res_mix = layer(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                                  ^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return forward_call(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     x, post_mix, res_mix = self.hc_pre(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                            ^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._op(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
(EngineCore pid=396927) ERROR
(EngineCore pid=396927)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=396927)     self._target(*self._args, **self._kwargs)
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1144, in run_engine_core
(EngineCore pid=396927)     raise e
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=396927)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=396927)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927)     return func(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=396927)     super().__init__(
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=396927)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=396927)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927)     return func(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=396927)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=396927)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=396927)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=396927)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=396927)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=396927)     return func(*args, **kwarg
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run
(EngineCore pid=396927)     outputs = self.model(
(EngineCore pid=396927)               ^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927)     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927)     return forward_call(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
(EngineCore pid=396927)     hidden_states = self.model(
(EngineCore pid=396927)                     ^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in __call__
(EngineCore pid=396927)     return self.forward(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward
(EngineCore pid=396927)     hidden_states, residual, post_mix, res_mix = layer(
(EngineCore pid=396927)                                                  ^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927)     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927)     return forward_call(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward
(EngineCore pid=396927)     x, post_mix, res_mix = self.hc_pre(
(EngineCore pid=396927)                            ^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre
(EngineCore pid=396927)     post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre(
(EngineCore pid=396927)                                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
(EngineCore pid=396927)     return self._op(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
(EngineCore pid=396927)     tf32_hc_prenorm_gemm(
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(EngineCore pid=396927)     return _tf32_hc_prenorm_gemm_impl(
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) RuntimeError: Assertion error (csrc/apis/hyperconnection.hpp:56): Unsupported architecture
[rank0]:[W512 14:50:24.024626144 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=396885) Traceback (most recent call last):
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/bin/vllm", line 8, in <module>
(APIServer pid=396885)     sys.exit(main())
(APIServer pid=396885)              ^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=396885)     args.dispatch_function(args)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=396885)     uvloop.run(run_server(args))
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=396885)     return __asyncio.run(
(APIServer pid=396885)            ^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=396885)     return runner.run(main)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=396885)     return self._loop.run_until_complete(task)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=396885)     return await main
(APIServer pid=396885)            ^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 679, in run_server
(APIServer pid=396885)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 693, in run_server_worker
(APIServer pid=396885)     async with build_async_engine_client(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=396885)     return await anext(self.gen)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=396885)     async with build_async_engine_client_from_engine_args(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=396885)     return await anext(self.gen)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=396885)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=396885)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=396885)     return cls(
(APIServer pid=396885)            ^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=396885)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=396885)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=396885)     return func(*args, **kwargs)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=396885)     return AsyncMPClient(*client_args)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=396885)     return func(*args, **kwargs)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=396885)     super().__init__(
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=396885)     with launch_core_engines(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=396885)     next(self.gen)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1133, in launch_core_engines
(APIServer pid=396885)     wait_for_engine_startup(
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1192, in wait_for_engine_startup
(APIServer pid=396885)     raise RuntimeError(
(APIServer pid=396885) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(deepseekv4vllm) root@ubuntugpu-h2-96G-01:~/wayne/deepseekv4vllm/vllm#

Fix Action

Fix / Workaround

BIOS Model name: INTEL(R) XEON(R) GOLD 6548Y+ CPU @ 2.3GHz BIOS CPU family: 2 CPU family: 6 Model: 207 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 BogoMIPS: 5000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscplm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves user_shstk avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Hypervisor vendor: VMware Virtualization type: full L1d cache: 768 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 32 MiB (16 instances) L3 cache: 60 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdow ROCM Version : Could not collect vLLM Version : 0.20.2rc1.dev254+ge1c8776e9 (git sha: e1c8776e9) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-15 0 N/A

here is error log

(deepseekv4vllm) root@ubuntugpu-h2-96G-01:~/wayne/deepseekv4vllm/vllm# CUDA_VISIBLE_DEVICES=0 VLLM_ENGINE_READY_TIMEOUT_S=3600 vllm serve deepseek-ai/DeepSeek-V4-Flash   --host 0.0.0.0   --port 8000   --trust-remote-code   --tensor-parallel-size 1   --max-model-len 4096   --kv-cache-dtype fp8   --block-size 256   --gpu-memory-utilization 0.70   --cpu-offload-gb 128     --max-num-seqs 1   --max-num-batched-tokens 1024   --enforce-eager   --tokenizer-mode deepseek_v4   --reasoning-parser deepseek_v4
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev254+ge1c8776e9
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-V4-Flash
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:240] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-V4-Flash', 'host': '0.0.0.0', 'model': 'deepseek-ai/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'max_model_len': 4096, 'enforce_eager': True, 'reasoning_parser': 'deepseek_v4', 'block_size': 256, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'cpu_offload_gb': 128.0, 'max_num_batched_tokens': 1024, 'max_num_seqs': 1}
(APIServer pid=396885) INFO 05-12 14:40:18 [config.py:800] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM.
(APIServer pid=396885) INFO 05-12 14:40:19 [model.py:568] Resolved architecture: DeepseekV4ForCausalLM
(APIServer pid=396885) INFO 05-12 14:40:19 [model.py:1697] Using max model len 4096
(APIServer pid=396885) INFO 05-12 14:40:20 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=396885) INFO 05-12 14:40:20 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024.
(APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:899] Asynchronous scheduling is enabled.
(APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:955] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:973] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=396885) INFO 05-12 14:40:20 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:1148] Cudagraph is disabled under eager mode
(APIServer pid=396885) WARNING 05-12 14:40:21 [vllm.py:1313] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implementedthe `reasoning_start_str` and `reasoning_end_str`.
(APIServer pid=396885) INFO 05-12 14:40:21 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=396927) INFO 05-12 14:40:26 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev254+ge1c8776e9) with config: model='deepseek-ai/DeepSeek-V4-Flash', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-V4-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.74.124.32:44933 backend=nccl
(EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=396927) INFO 05-12 14:40:27 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=396927) INFO 05-12 14:40:27 [base.py:123] Offloader set to UVAOffloader
(EngineCore pid=396927) INFO 05-12 14:40:27 [gpu_model_runner.py:4863] Starting to load model deepseek-ai/DeepSeek-V4-Flash...
(EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4.py:168] DeepSeek V4 expert_dtype resolved to 'fp4'
(EngineCore pid=396927) INFO 05-12 14:40:28 [__init__.py:393] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod
(EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4_attention.py:710] Using DeepSeek's fp8_ds_mla KV cache format.
(EngineCore pid=396927) INFO 05-12 14:40:28 [mxfp4.py:551] Using 'MARLIN' Mxfp4 MoE backend.
(EngineCore pid=396927) INFO 05-12 14:40:34 [deepseek_v4_attention.py:1092] Using FP8 indexer cache for Lightning Indexer.
(EngineCore pid=396927) INFO 05-12 14:43:27 [uva.py:58] Total CPU offloaded parameters: 128.8
(EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:938] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 114.56 GiB.
(EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:968] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (114.56 GiB).
Loading safetensors checkpoint shards:   0% Completed | 0/46 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/46 [00:02<01:32,  2.06s/it]
Loading safetensors checkpoint shards:   4% Completed | 2/46 [00:09<03:55,  5.35s/it]
Loading safetensors checkpoint shards:   7% Completed | 3/46 [00:17<04:32,  6.34s/it]
Loading safetensors checkpoint shards:   9% Completed | 4/46 [00:24<04:47,  6.85s/it]
Loading safetensors checkpoint shards:  11% Completed | 5/46 [00:32<04:48,  7.03s/it]
Loading safetensors checkpoint shards:  13% Completed | 6/46 [00:39<04:44,  7.12s/it]
Loading safetensors checkpoint shards:  15% Completed | 7/46 [00:46<04:39,  7.16s/it]
Loading safetensors checkpoint shards:  17% Completed | 8/46 [00:49<03:43,  5.89s/it]
Loading safetensors checkpoint shards:  20% Completed | 9/46 [00:53<03:11,  5.17s/it]
Loading safetensors checkpoint shards:  22% Completed | 10/46 [00:56<02:45,  4.59s/it]
Loading safetensors checkpoint shards:  24% Completed | 11/46 [01:00<02:26,  4.19s/it]
Loading safetensors checkpoint shards:  26% Completed | 12/46 [01:03<02:14,  3.95s/it]
Loading safetensors checkpoint shards:  28% Completed | 13/46 [01:07<02:11,  3.99s/it]
Loading safetensors checkpoint shards:  30% Completed | 14/46 [01:11<02:06,  3.95s/it]
Loading safetensors checkpoint shards:  33% Completed | 15/46 [01:16<02:15,  4.37s/it]
Loading safetensors checkpoint shards:  35% Completed | 16/46 [01:22<02:19,  4.65s/it]
Loading safetensors checkpoint shards:  37% Completed | 17/46 [01:27<02:21,  4.87s/it]
Loading safetensors checkpoint shards:  39% Completed | 18/46 [01:32<02:14,  4.80s/it]
Loading safetensors checkpoint shards:  41% Completed | 19/46 [01:36<02:02,  4.55s/it]
Loading safetensors checkpoint shards:  43% Completed | 20/46 [01:40<02:00,  4.64s/it]
Loading safetensors checkpoint shards:  46% Completed | 21/46 [01:44<01:49,  4.38s/it]
Loading safetensors checkpoint shards:  48% Completed | 22/46 [01:48<01:41,  4.21s/it]
Loading safetensors checkpoint shards:  50% Completed | 23/46 [01:53<01:39,  4.33s/it]
Loading safetensors checkpoint shards:  52% Completed | 24/46 [01:57<01:35,  4.35s/it]
Loading safetensors checkpoint shards:  54% Completed | 25/46 [02:01<01:30,  4.32s/it]
Loading safetensors checkpoint shards:  57% Completed | 26/46 [02:05<01:25,  4.27s/it]
Loading safetensors checkpoint shards:  59% Completed | 27/46 [02:10<01:24,  4.44s/it]
Loading safetensors checkpoint shards:  61% Completed | 28/46 [02:15<01:22,  4.56s/it]
Loading safetensors checkpoint shards:  63% Completed | 29/46 [02:20<01:17,  4.57s/it]
Loading safetensors checkpoint shards:  65% Completed | 30/46 [02:24<01:10,  4.39s/it]
Loading safetensors checkpoint shards:  67% Completed | 31/46 [02:27<01:02,  4.15s/it]
Loading safetensors checkpoint shards:  70% Completed | 32/46 [02:32<01:00,  4.35s/it]
Loading safetensors checkpoint shards:  72% Completed | 33/46 [02:36<00:56,  4.38s/it]
Loading safetensors checkpoint shards:  74% Completed | 34/46 [02:40<00:50,  4.22s/it]
Loading safetensors checkpoint shards:  76% Completed | 35/46 [02:45<00:47,  4.36s/it]
Loading safetensors checkpoint shards:  78% Completed | 36/46 [02:55<01:01,  6.17s/it]
Loading safetensors checkpoint shards:  80% Completed | 37/46 [03:00<00:52,  5.79s/it]
Loading safetensors checkpoint shards:  83% Completed | 38/46 [03:06<00:45,  5.68s/it]
Loading safetensors checkpoint shards:  85% Completed | 39/46 [03:11<00:39,  5.66s/it]
Loading safetensors checkpoint shards:  87% Completed | 40/46 [03:17<00:33,  5.53s/it]
Loading safetensors checkpoint shards:  89% Completed | 41/46 [03:21<00:26,  5.30s/it]
Loading safetensors checkpoint shards:  91% Completed | 42/46 [03:27<00:21,  5.37s/it]
Loading safetensors checkpoint shards:  93% Completed | 43/46 [03:31<00:14,  4.88s/it]
Loading safetensors checkpoint shards:  96% Completed | 44/46 [03:34<00:08,  4.43s/it]
Loading safetensors checkpoint shards:  98% Completed | 45/46 [03:36<00:03,  3.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00,  3.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00,  4.74s/it]
(EngineCore pid=396927)
(EngineCore pid=396927) INFO 05-12 14:47:07 [default_loader.py:397] Loading weights took 217.96 seconds
(EngineCore pid=396927) INFO 05-12 14:47:11 [mxfp4.py:1497] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=396927) INFO 05-12 14:50:19 [gpu_model_runner.py:4965] Model loading took 17.03 GiB memory and 571.286208 seconds
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] EngineCore failed to start.
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] Traceback (most recent call last):
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     super().__init__(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     self.model_runner.profile_run()
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5954, in profile_run
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     outputs = self.model(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]               ^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return forward_call(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states = self.model(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                     ^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in __call__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self.forward(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states, residual, post_mix, res_mix = layer(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                                  ^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return forward_call(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     x, post_mix, res_mix = self.hc_pre(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                            ^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._op(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
(EngineCore pid=396927) ERROR
(EngineCore pid=396927)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=396927)     self._target(*self._args, **self._kwargs)
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1144, in run_engine_core
(EngineCore pid=396927)     raise e
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=396927)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=396927)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927)     return func(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=396927)     super().__init__(
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=396927)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=396927)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927)     return func(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=396927)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=396927)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=396927)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=396927)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=396927)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=396927)     return func(*args, **kwarg
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run
(EngineCore pid=396927)     outputs = self.model(
(EngineCore pid=396927)               ^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927)     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927)     return forward_call(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
(EngineCore pid=396927)     hidden_states = self.model(
(EngineCore pid=396927)                     ^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in __call__
(EngineCore pid=396927)     return self.forward(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward
(EngineCore pid=396927)     hidden_states, residual, post_mix, res_mix = layer(
(EngineCore pid=396927)                                                  ^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927)     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927)     return forward_call(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward
(EngineCore pid=396927)     x, post_mix, res_mix = self.hc_pre(
(EngineCore pid=396927)                            ^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre
(EngineCore pid=396927)     post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre(
(EngineCore pid=396927)                                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
(EngineCore pid=396927)     return self._op(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
(EngineCore pid=396927)     tf32_hc_prenorm_gemm(
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(EngineCore pid=396927)     return _tf32_hc_prenorm_gemm_impl(
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) RuntimeError: Assertion error (csrc/apis/hyperconnection.hpp:56): Unsupported architecture
[rank0]:[W512 14:50:24.024626144 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=396885) Traceback (most recent call last):
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/bin/vllm", line 8, in <module>
(APIServer pid=396885)     sys.exit(main())
(APIServer pid=396885)              ^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=396885)     args.dispatch_function(args)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=396885)     uvloop.run(run_server(args))
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=396885)     return __asyncio.run(
(APIServer pid=396885)            ^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=396885)     return runner.run(main)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=396885)     return self._loop.run_until_complete(task)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=396885)     return await main
(APIServer pid=396885)            ^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 679, in run_server
(APIServer pid=396885)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 693, in run_server_worker
(APIServer pid=396885)     async with build_async_engine_client(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=396885)     return await anext(self.gen)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=396885)     async with build_async_engine_client_from_engine_args(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=396885)     return await anext(self.gen)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=396885)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=396885)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=396885)     return cls(
(APIServer pid=396885)            ^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=396885)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=396885)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=396885)     return func(*args, **kwargs)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=396885)     return AsyncMPClient(*client_args)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=396885)     return func(*args, **kwargs)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=396885)     super().__init__(
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=396885)     with launch_core_engines(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=396885)     next(self.gen)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1133, in launch_core_engines
(APIServer pid=396885)     wait_for_engine_startup(
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1192, in wait_for_engine_startup
(APIServer pid=396885)     raise RuntimeError(
(APIServer pid=396885) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(deepseekv4vllm) root@ubuntugpu-h2-96G-01:~/wayne/deepseekv4vllm/vllm#

Code Example

Your output of `python collect_env.py` here

---

Collecting environment information...

BIOS Model name:                         INTEL(R) XEON(R) GOLD 6548Y+  CPU @ 2.3GHz
BIOS CPU family:                         2
CPU family:                              6
Model:                                   207
Thread(s) per core:                      1
Core(s) per socket:                      16
Socket(s):                               1
Stepping:                                2
BogoMIPS:                                5000.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscplm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves user_shstk avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Hypervisor vendor:                       VMware
Virtualization type:                     full
L1d cache:                               768 KiB (16 instances)
L1i cache:                               512 KiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                60 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             KVM: Mitigation: VMX unsupported
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdow
ROCM Version                 : Could not collect
vLLM Version                 : 0.20.2rc1.dev254+ge1c8776e9 (git sha: e1c8776e9)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
GPU Topology:
        GPU0    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      0-15    0               N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

==============================
     Environment Variables
==============================
LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64:
CUDA_HOME=/usr/local/cuda-13.0
CUDA_HOME=/usr/local/cuda-13.0
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

---

(deepseekv4vllm) root@ubuntugpu-h2-96G-01:~/wayne/deepseekv4vllm/vllm# CUDA_VISIBLE_DEVICES=0 VLLM_ENGINE_READY_TIMEOUT_S=3600 vllm serve deepseek-ai/DeepSeek-V4-Flash   --host 0.0.0.0   --port 8000   --trust-remote-code   --tensor-parallel-size 1   --max-model-len 4096   --kv-cache-dtype fp8   --block-size 256   --gpu-memory-utilization 0.70   --cpu-offload-gb 128     --max-num-seqs 1   --max-num-batched-tokens 1024   --enforce-eager   --tokenizer-mode deepseek_v4   --reasoning-parser deepseek_v4
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]        █     █     █▄   ▄█
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.20.2rc1.dev254+ge1c8776e9
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]   █▄█▀ █     █     █     █  model   deepseek-ai/DeepSeek-V4-Flash
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306]
(APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:240] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-V4-Flash', 'host': '0.0.0.0', 'model': 'deepseek-ai/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'max_model_len': 4096, 'enforce_eager': True, 'reasoning_parser': 'deepseek_v4', 'block_size': 256, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'cpu_offload_gb': 128.0, 'max_num_batched_tokens': 1024, 'max_num_seqs': 1}
(APIServer pid=396885) INFO 05-12 14:40:18 [config.py:800] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM.
(APIServer pid=396885) INFO 05-12 14:40:19 [model.py:568] Resolved architecture: DeepseekV4ForCausalLM
(APIServer pid=396885) INFO 05-12 14:40:19 [model.py:1697] Using max model len 4096
(APIServer pid=396885) INFO 05-12 14:40:20 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor
(APIServer pid=396885) INFO 05-12 14:40:20 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024.
(APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:899] Asynchronous scheduling is enabled.
(APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:955] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:973] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=396885) INFO 05-12 14:40:20 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native'])
(APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:1148] Cudagraph is disabled under eager mode
(APIServer pid=396885) WARNING 05-12 14:40:21 [vllm.py:1313] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implementedthe `reasoning_start_str` and `reasoning_end_str`.
(APIServer pid=396885) INFO 05-12 14:40:21 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=396927) INFO 05-12 14:40:26 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev254+ge1c8776e9) with config: model='deepseek-ai/DeepSeek-V4-Flash', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-V4-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto')
(EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.74.124.32:44933 backend=nccl
(EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=396927) INFO 05-12 14:40:27 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling.
(EngineCore pid=396927) INFO 05-12 14:40:27 [base.py:123] Offloader set to UVAOffloader
(EngineCore pid=396927) INFO 05-12 14:40:27 [gpu_model_runner.py:4863] Starting to load model deepseek-ai/DeepSeek-V4-Flash...
(EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4.py:168] DeepSeek V4 expert_dtype resolved to 'fp4'
(EngineCore pid=396927) INFO 05-12 14:40:28 [__init__.py:393] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod
(EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4_attention.py:710] Using DeepSeek's fp8_ds_mla KV cache format.
(EngineCore pid=396927) INFO 05-12 14:40:28 [mxfp4.py:551] Using 'MARLIN' Mxfp4 MoE backend.
(EngineCore pid=396927) INFO 05-12 14:40:34 [deepseek_v4_attention.py:1092] Using FP8 indexer cache for Lightning Indexer.
(EngineCore pid=396927) INFO 05-12 14:43:27 [uva.py:58] Total CPU offloaded parameters: 128.8
(EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:938] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 114.56 GiB.
(EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:968] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (114.56 GiB).
Loading safetensors checkpoint shards:   0% Completed | 0/46 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:   2% Completed | 1/46 [00:02<01:32,  2.06s/it]
Loading safetensors checkpoint shards:   4% Completed | 2/46 [00:09<03:55,  5.35s/it]
Loading safetensors checkpoint shards:   7% Completed | 3/46 [00:17<04:32,  6.34s/it]
Loading safetensors checkpoint shards:   9% Completed | 4/46 [00:24<04:47,  6.85s/it]
Loading safetensors checkpoint shards:  11% Completed | 5/46 [00:32<04:48,  7.03s/it]
Loading safetensors checkpoint shards:  13% Completed | 6/46 [00:39<04:44,  7.12s/it]
Loading safetensors checkpoint shards:  15% Completed | 7/46 [00:46<04:39,  7.16s/it]
Loading safetensors checkpoint shards:  17% Completed | 8/46 [00:49<03:43,  5.89s/it]
Loading safetensors checkpoint shards:  20% Completed | 9/46 [00:53<03:11,  5.17s/it]
Loading safetensors checkpoint shards:  22% Completed | 10/46 [00:56<02:45,  4.59s/it]
Loading safetensors checkpoint shards:  24% Completed | 11/46 [01:00<02:26,  4.19s/it]
Loading safetensors checkpoint shards:  26% Completed | 12/46 [01:03<02:14,  3.95s/it]
Loading safetensors checkpoint shards:  28% Completed | 13/46 [01:07<02:11,  3.99s/it]
Loading safetensors checkpoint shards:  30% Completed | 14/46 [01:11<02:06,  3.95s/it]
Loading safetensors checkpoint shards:  33% Completed | 15/46 [01:16<02:15,  4.37s/it]
Loading safetensors checkpoint shards:  35% Completed | 16/46 [01:22<02:19,  4.65s/it]
Loading safetensors checkpoint shards:  37% Completed | 17/46 [01:27<02:21,  4.87s/it]
Loading safetensors checkpoint shards:  39% Completed | 18/46 [01:32<02:14,  4.80s/it]
Loading safetensors checkpoint shards:  41% Completed | 19/46 [01:36<02:02,  4.55s/it]
Loading safetensors checkpoint shards:  43% Completed | 20/46 [01:40<02:00,  4.64s/it]
Loading safetensors checkpoint shards:  46% Completed | 21/46 [01:44<01:49,  4.38s/it]
Loading safetensors checkpoint shards:  48% Completed | 22/46 [01:48<01:41,  4.21s/it]
Loading safetensors checkpoint shards:  50% Completed | 23/46 [01:53<01:39,  4.33s/it]
Loading safetensors checkpoint shards:  52% Completed | 24/46 [01:57<01:35,  4.35s/it]
Loading safetensors checkpoint shards:  54% Completed | 25/46 [02:01<01:30,  4.32s/it]
Loading safetensors checkpoint shards:  57% Completed | 26/46 [02:05<01:25,  4.27s/it]
Loading safetensors checkpoint shards:  59% Completed | 27/46 [02:10<01:24,  4.44s/it]
Loading safetensors checkpoint shards:  61% Completed | 28/46 [02:15<01:22,  4.56s/it]
Loading safetensors checkpoint shards:  63% Completed | 29/46 [02:20<01:17,  4.57s/it]
Loading safetensors checkpoint shards:  65% Completed | 30/46 [02:24<01:10,  4.39s/it]
Loading safetensors checkpoint shards:  67% Completed | 31/46 [02:27<01:02,  4.15s/it]
Loading safetensors checkpoint shards:  70% Completed | 32/46 [02:32<01:00,  4.35s/it]
Loading safetensors checkpoint shards:  72% Completed | 33/46 [02:36<00:56,  4.38s/it]
Loading safetensors checkpoint shards:  74% Completed | 34/46 [02:40<00:50,  4.22s/it]
Loading safetensors checkpoint shards:  76% Completed | 35/46 [02:45<00:47,  4.36s/it]
Loading safetensors checkpoint shards:  78% Completed | 36/46 [02:55<01:01,  6.17s/it]
Loading safetensors checkpoint shards:  80% Completed | 37/46 [03:00<00:52,  5.79s/it]
Loading safetensors checkpoint shards:  83% Completed | 38/46 [03:06<00:45,  5.68s/it]
Loading safetensors checkpoint shards:  85% Completed | 39/46 [03:11<00:39,  5.66s/it]
Loading safetensors checkpoint shards:  87% Completed | 40/46 [03:17<00:33,  5.53s/it]
Loading safetensors checkpoint shards:  89% Completed | 41/46 [03:21<00:26,  5.30s/it]
Loading safetensors checkpoint shards:  91% Completed | 42/46 [03:27<00:21,  5.37s/it]
Loading safetensors checkpoint shards:  93% Completed | 43/46 [03:31<00:14,  4.88s/it]
Loading safetensors checkpoint shards:  96% Completed | 44/46 [03:34<00:08,  4.43s/it]
Loading safetensors checkpoint shards:  98% Completed | 45/46 [03:36<00:03,  3.73s/it]
Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00,  3.01s/it]
Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00,  4.74s/it]
(EngineCore pid=396927)
(EngineCore pid=396927) INFO 05-12 14:47:07 [default_loader.py:397] Loading weights took 217.96 seconds
(EngineCore pid=396927) INFO 05-12 14:47:11 [mxfp4.py:1497] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=396927) INFO 05-12 14:50:19 [gpu_model_runner.py:4965] Model loading took 17.03 GiB memory and 571.286208 seconds
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] EngineCore failed to start.
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] Traceback (most recent call last):
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     super().__init__(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self.collective_rpc("determine_available_memory")
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     self.model_runner.profile_run()
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5954, in profile_run
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states, last_hidden_states = self._dummy_run(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                         ^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return func(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     outputs = self.model(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]               ^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return forward_call(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states = self.model(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                     ^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in __call__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self.forward(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     hidden_states, residual, post_mix, res_mix = layer(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                                  ^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return forward_call(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     x, post_mix, res_mix = self.hc_pre(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                            ^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre(
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]                                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]     return self._op(*args, **kwargs)
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140]   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
(EngineCore pid=396927) ERROR
(EngineCore pid=396927)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=396927)     self._target(*self._args, **self._kwargs)
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1144, in run_engine_core
(EngineCore pid=396927)     raise e
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core
(EngineCore pid=396927)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=396927)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927)     return func(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in __init__
(EngineCore pid=396927)     super().__init__(
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in __init__
(EngineCore pid=396927)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=396927)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=396927)     return func(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches
(EngineCore pid=396927)     available_gpu_memory = self.model_executor.determine_available_memory()
(EngineCore pid=396927)                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory
(EngineCore pid=396927)     return self.collective_rpc("determine_available_memory")
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc
(EngineCore pid=396927)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=396927)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method
(EngineCore pid=396927)     return func(*args, **kwarg
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run
(EngineCore pid=396927)     outputs = self.model(
(EngineCore pid=396927)               ^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927)     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927)     return forward_call(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward
(EngineCore pid=396927)     hidden_states = self.model(
(EngineCore pid=396927)                     ^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in __call__
(EngineCore pid=396927)     return self.forward(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward
(EngineCore pid=396927)     hidden_states, residual, post_mix, res_mix = layer(
(EngineCore pid=396927)                                                  ^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl
(EngineCore pid=396927)     return self._call_impl(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl
(EngineCore pid=396927)     return forward_call(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward
(EngineCore pid=396927)     x, post_mix, res_mix = self.hc_pre(
(EngineCore pid=396927)                            ^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre
(EngineCore pid=396927)     post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre(
(EngineCore pid=396927)                                      ^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in __call__
(EngineCore pid=396927)     return self._op(*args, **kwargs)
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre
(EngineCore pid=396927)     tf32_hc_prenorm_gemm(
(EngineCore pid=396927)   File "/root/wayne/deepseekv4vllm/vllm/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm
(EngineCore pid=396927)     return _tf32_hc_prenorm_gemm_impl(
(EngineCore pid=396927)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=396927) RuntimeError: Assertion error (csrc/apis/hyperconnection.hpp:56): Unsupported architecture
[rank0]:[W512 14:50:24.024626144 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=396885) Traceback (most recent call last):
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/bin/vllm", line 8, in <module>
(APIServer pid=396885)     sys.exit(main())
(APIServer pid=396885)              ^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/main.py", line 92, in main
(APIServer pid=396885)     args.dispatch_function(args)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd
(APIServer pid=396885)     uvloop.run(run_server(args))
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=396885)     return __asyncio.run(
(APIServer pid=396885)            ^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run
(APIServer pid=396885)     return runner.run(main)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=396885)     return self._loop.run_until_complete(task)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=396885)     return await main
(APIServer pid=396885)            ^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 679, in run_server
(APIServer pid=396885)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 693, in run_server_worker
(APIServer pid=396885)     async with build_async_engine_client(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=396885)     return await anext(self.gen)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client
(APIServer pid=396885)     async with build_async_engine_client_from_engine_args(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=396885)     return await anext(self.gen)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args
(APIServer pid=396885)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=396885)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config
(APIServer pid=396885)     return cls(
(APIServer pid=396885)            ^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 146, in __init__
(APIServer pid=396885)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=396885)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=396885)     return func(*args, **kwargs)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client
(APIServer pid=396885)     return AsyncMPClient(*client_args)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=396885)     return func(*args, **kwargs)
(APIServer pid=396885)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 900, in __init__
(APIServer pid=396885)     super().__init__(
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 535, in __init__
(APIServer pid=396885)     with launch_core_engines(
(APIServer pid=396885)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=396885)     next(self.gen)
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1133, in launch_core_engines
(APIServer pid=396885)     wait_for_engine_startup(
(APIServer pid=396885)   File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1192, in wait_for_engine_startup
(APIServer pid=396885)     raise RuntimeError(
(APIServer pid=396885) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
(deepseekv4vllm) root@ubuntugpu-h2-96G-01:~/wayne/deepseekv4vllm/vllm#
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details> ``` Collecting environment information...

BIOS Model name: INTEL(R) XEON(R) GOLD 6548Y+ CPU @ 2.3GHz BIOS CPU family: 2 CPU family: 6 Model: 207 Thread(s) per core: 1 Core(s) per socket: 16 Socket(s): 1 Stepping: 2 BogoMIPS: 5000.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscplm constant_tsc arch_perfmon rep_good nopl xtopology tsc_reliable nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves user_shstk avx_vnni avx512_bf16 wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid cldemote movdiri movdir64b fsrm md_clear serialize amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Hypervisor vendor: VMware Virtualization type: full L1d cache: 768 KiB (16 instances) L1i cache: 512 KiB (16 instances) L2 cache: 32 MiB (16 instances) L3 cache: 60 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-15 Vulnerability Gather data sampling: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: KVM: Mitigation: VMX unsupported Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdow ROCM Version : Could not collect vLLM Version : 0.20.2rc1.dev254+ge1c8776e9 (git sha: e1c8776e9) vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X 0-15 0 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

============================== Environment Variables

LD_LIBRARY_PATH=/usr/local/cuda-13.0/lib64: CUDA_HOME=/usr/local/cuda-13.0 CUDA_HOME=/usr/local/cuda-13.0 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root


### 🐛 Describe the bug

here is error log

(deepseekv4vllm) root@ubuntugpu-h2-96G-01:/wayne/deepseekv4vllm/vllm# CUDA_VISIBLE_DEVICES=0 VLLM_ENGINE_READY_TIMEOUT_S=3600 vllm serve deepseek-ai/DeepSeek-V4-Flash --host 0.0.0.0 --port 8000 --trust-remote-code --tensor-parallel-size 1 --max-model-len 4096 --kv-cache-dtype fp8 --block-size 256 --gpu-memory-utilization 0.70 --cpu-offload-gb 128 --max-num-seqs 1 --max-num-batched-tokens 1024 --enforce-eager --tokenizer-mode deepseek_v4 --reasoning-parser deepseek_v4 (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] █ █ █▄ ▄█ (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.20.2rc1.dev254+ge1c8776e9 (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] █▄█▀ █ █ █ █ model deepseek-ai/DeepSeek-V4-Flash (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:306] (APIServer pid=396885) INFO 05-12 14:40:18 [utils.py:240] non-default args: {'model_tag': 'deepseek-ai/DeepSeek-V4-Flash', 'host': '0.0.0.0', 'model': 'deepseek-ai/DeepSeek-V4-Flash', 'tokenizer_mode': 'deepseek_v4', 'trust_remote_code': True, 'max_model_len': 4096, 'enforce_eager': True, 'reasoning_parser': 'deepseek_v4', 'block_size': 256, 'gpu_memory_utilization': 0.7, 'kv_cache_dtype': 'fp8', 'cpu_offload_gb': 128.0, 'max_num_batched_tokens': 1024, 'max_num_seqs': 1} (APIServer pid=396885) INFO 05-12 14:40:18 [config.py:800] Detected quantization_config.scale_fmt=ue8m0; enabling UE8M0 for DeepGEMM. (APIServer pid=396885) INFO 05-12 14:40:19 [model.py:568] Resolved architecture: DeepseekV4ForCausalLM (APIServer pid=396885) INFO 05-12 14:40:19 [model.py:1697] Using max model len 4096 (APIServer pid=396885) INFO 05-12 14:40:20 [cache.py:261] Using fp8 data type to store kv cache. It reduces the GPU memory footprint and boosts the performance. Meanwhile, it may cause accuracy drop without a proper scaling factor (APIServer pid=396885) INFO 05-12 14:40:20 [scheduler.py:239] Chunked prefill is enabled with max_num_batched_tokens=1024. (APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:899] Asynchronous scheduling is enabled. (APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:955] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=396885) WARNING 05-12 14:40:20 [vllm.py:973] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=396885) INFO 05-12 14:40:20 [kernel.py:212] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']) (APIServer pid=396885) INFO 05-12 14:40:20 [vllm.py:1148] Cudagraph is disabled under eager mode (APIServer pid=396885) WARNING 05-12 14:40:21 [vllm.py:1313] Auto-initialization of reasoning token IDs failed. Please check whether your reasoning parser has implementedthe reasoning_start_str and reasoning_end_str. (APIServer pid=396885) INFO 05-12 14:40:21 [compilation.py:312] Enabled custom fusions: norm_quant, act_quant (EngineCore pid=396927) INFO 05-12 14:40:26 [core.py:109] Initializing a V1 LLM engine (v0.20.2rc1.dev254+ge1c8776e9) with config: model='deepseek-ai/DeepSeek-V4-Flash', speculative_config=None, tokenizer='deepseek-ai/DeepSeek-V4-Flash', skip_tokenizer_init=False, tokenizer_mode=deepseek_v4, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=deepseek_v4_fp8, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=fp8, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='deepseek_v4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=deepseek-ai/DeepSeek-V4-Flash, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'all', '+quant_fp8'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_vision_items_per_batch': 0, 'encoder_cudagraph_max_frames_per_batch': None, 'compile_sizes': [], 'compile_ranges_endpoints': [1024], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False, 'fuse_rope_kvcache_cat_mla': False, 'fuse_act_padding': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['vllm_c', 'native'], fused_add_rms_norm=['vllm_c', 'native']), enable_flashinfer_autotune=False, moe_backend='auto') (EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1410] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.74.124.32:44933 backend=nccl (EngineCore pid=396927) INFO 05-12 14:40:26 [parallel_state.py:1723] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A (EngineCore pid=396927) INFO 05-12 14:40:27 [topk_topp_sampler.py:45] Using FlashInfer for top-p & top-k sampling. (EngineCore pid=396927) INFO 05-12 14:40:27 [base.py:123] Offloader set to UVAOffloader (EngineCore pid=396927) INFO 05-12 14:40:27 [gpu_model_runner.py:4863] Starting to load model deepseek-ai/DeepSeek-V4-Flash... (EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4.py:168] DeepSeek V4 expert_dtype resolved to 'fp4' (EngineCore pid=396927) INFO 05-12 14:40:28 [init.py:393] Selected CutlassFp8BlockScaledMMKernel for Fp8LinearMethod (EngineCore pid=396927) INFO 05-12 14:40:28 [deepseek_v4_attention.py:710] Using DeepSeek's fp8_ds_mla KV cache format. (EngineCore pid=396927) INFO 05-12 14:40:28 [mxfp4.py:551] Using 'MARLIN' Mxfp4 MoE backend. (EngineCore pid=396927) INFO 05-12 14:40:34 [deepseek_v4_attention.py:1092] Using FP8 indexer cache for Lightning Indexer. (EngineCore pid=396927) INFO 05-12 14:43:27 [uva.py:58] Total CPU offloaded parameters: 128.8 (EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:938] Filesystem type for checkpoints: EXT4. Checkpoint size: 148.66 GiB. Available RAM: 114.56 GiB. (EngineCore pid=396927) INFO 05-12 14:43:29 [weight_utils.py:968] Auto-prefetch is disabled because the filesystem (EXT4) is not a recognized network FS (NFS/Lustre) and the checkpoint size (148.66 GiB) exceeds 90% of available RAM (114.56 GiB). Loading safetensors checkpoint shards: 0% Completed | 0/46 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 2% Completed | 1/46 [00:02<01:32, 2.06s/it] Loading safetensors checkpoint shards: 4% Completed | 2/46 [00:09<03:55, 5.35s/it] Loading safetensors checkpoint shards: 7% Completed | 3/46 [00:17<04:32, 6.34s/it] Loading safetensors checkpoint shards: 9% Completed | 4/46 [00:24<04:47, 6.85s/it] Loading safetensors checkpoint shards: 11% Completed | 5/46 [00:32<04:48, 7.03s/it] Loading safetensors checkpoint shards: 13% Completed | 6/46 [00:39<04:44, 7.12s/it] Loading safetensors checkpoint shards: 15% Completed | 7/46 [00:46<04:39, 7.16s/it] Loading safetensors checkpoint shards: 17% Completed | 8/46 [00:49<03:43, 5.89s/it] Loading safetensors checkpoint shards: 20% Completed | 9/46 [00:53<03:11, 5.17s/it] Loading safetensors checkpoint shards: 22% Completed | 10/46 [00:56<02:45, 4.59s/it] Loading safetensors checkpoint shards: 24% Completed | 11/46 [01:00<02:26, 4.19s/it] Loading safetensors checkpoint shards: 26% Completed | 12/46 [01:03<02:14, 3.95s/it] Loading safetensors checkpoint shards: 28% Completed | 13/46 [01:07<02:11, 3.99s/it] Loading safetensors checkpoint shards: 30% Completed | 14/46 [01:11<02:06, 3.95s/it] Loading safetensors checkpoint shards: 33% Completed | 15/46 [01:16<02:15, 4.37s/it] Loading safetensors checkpoint shards: 35% Completed | 16/46 [01:22<02:19, 4.65s/it] Loading safetensors checkpoint shards: 37% Completed | 17/46 [01:27<02:21, 4.87s/it] Loading safetensors checkpoint shards: 39% Completed | 18/46 [01:32<02:14, 4.80s/it] Loading safetensors checkpoint shards: 41% Completed | 19/46 [01:36<02:02, 4.55s/it] Loading safetensors checkpoint shards: 43% Completed | 20/46 [01:40<02:00, 4.64s/it] Loading safetensors checkpoint shards: 46% Completed | 21/46 [01:44<01:49, 4.38s/it] Loading safetensors checkpoint shards: 48% Completed | 22/46 [01:48<01:41, 4.21s/it] Loading safetensors checkpoint shards: 50% Completed | 23/46 [01:53<01:39, 4.33s/it] Loading safetensors checkpoint shards: 52% Completed | 24/46 [01:57<01:35, 4.35s/it] Loading safetensors checkpoint shards: 54% Completed | 25/46 [02:01<01:30, 4.32s/it] Loading safetensors checkpoint shards: 57% Completed | 26/46 [02:05<01:25, 4.27s/it] Loading safetensors checkpoint shards: 59% Completed | 27/46 [02:10<01:24, 4.44s/it] Loading safetensors checkpoint shards: 61% Completed | 28/46 [02:15<01:22, 4.56s/it] Loading safetensors checkpoint shards: 63% Completed | 29/46 [02:20<01:17, 4.57s/it] Loading safetensors checkpoint shards: 65% Completed | 30/46 [02:24<01:10, 4.39s/it] Loading safetensors checkpoint shards: 67% Completed | 31/46 [02:27<01:02, 4.15s/it] Loading safetensors checkpoint shards: 70% Completed | 32/46 [02:32<01:00, 4.35s/it] Loading safetensors checkpoint shards: 72% Completed | 33/46 [02:36<00:56, 4.38s/it] Loading safetensors checkpoint shards: 74% Completed | 34/46 [02:40<00:50, 4.22s/it] Loading safetensors checkpoint shards: 76% Completed | 35/46 [02:45<00:47, 4.36s/it] Loading safetensors checkpoint shards: 78% Completed | 36/46 [02:55<01:01, 6.17s/it] Loading safetensors checkpoint shards: 80% Completed | 37/46 [03:00<00:52, 5.79s/it] Loading safetensors checkpoint shards: 83% Completed | 38/46 [03:06<00:45, 5.68s/it] Loading safetensors checkpoint shards: 85% Completed | 39/46 [03:11<00:39, 5.66s/it] Loading safetensors checkpoint shards: 87% Completed | 40/46 [03:17<00:33, 5.53s/it] Loading safetensors checkpoint shards: 89% Completed | 41/46 [03:21<00:26, 5.30s/it] Loading safetensors checkpoint shards: 91% Completed | 42/46 [03:27<00:21, 5.37s/it] Loading safetensors checkpoint shards: 93% Completed | 43/46 [03:31<00:14, 4.88s/it] Loading safetensors checkpoint shards: 96% Completed | 44/46 [03:34<00:08, 4.43s/it] Loading safetensors checkpoint shards: 98% Completed | 45/46 [03:36<00:03, 3.73s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00, 3.01s/it] Loading safetensors checkpoint shards: 100% Completed | 46/46 [03:37<00:00, 4.74s/it] (EngineCore pid=396927) (EngineCore pid=396927) INFO 05-12 14:47:07 [default_loader.py:397] Loading weights took 217.96 seconds (EngineCore pid=396927) INFO 05-12 14:47:11 [mxfp4.py:1497] Using MoEPrepareAndFinalizeNoDPEPModular (EngineCore pid=396927) INFO 05-12 14:50:19 [gpu_model_runner.py:4965] Model loading took 17.03 GiB memory and 571.286208 seconds (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] EngineCore failed to start. (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] Traceback (most recent call last): (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in init (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] super().init( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in init (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self.collective_rpc("determine_available_memory") (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_worker.py", line 392, in determine_available_memory (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] self.model_runner.profile_run() (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5954, in profile_run (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] hidden_states, last_hidden_states = self._dummy_run( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return func(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] outputs = self.model( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return forward_call(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] hidden_states = self.model( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in call (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self.forward(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] hidden_states, residual, post_mix, res_mix = layer( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return forward_call(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] x, post_mix, res_mix = self.hc_pre( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre( (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in call (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] return self._op(*args, **kwargs) (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) ERROR 05-12 14:50:20 [core.py:1140] File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre (EngineCore pid=396927) ERROR (EngineCore pid=396927) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=396927) self._target(*self._args, **self._kwargs) (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1144, in run_engine_core (EngineCore pid=396927) raise e (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 1114, in run_engine_core (EngineCore pid=396927) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) return func(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 880, in init (EngineCore pid=396927) super().init( (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 128, in init (EngineCore pid=396927) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=396927) return func(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core.py", line 250, in _initialize_kv_caches (EngineCore pid=396927) available_gpu_memory = self.model_executor.determine_available_memory() (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/abstract.py", line 147, in determine_available_memory (EngineCore pid=396927) return self.collective_rpc("determine_available_memory") (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/executor/uniproc_executor.py", line 93, in collective_rpc (EngineCore pid=396927) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/serial_utils.py", line 510, in run_method (EngineCore pid=396927) return func(*args, **kwarg (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/worker/gpu_model_runner.py", line 5622, in _dummy_run (EngineCore pid=396927) outputs = self.model( (EngineCore pid=396927) ^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) return forward_call(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1669, in forward (EngineCore pid=396927) hidden_states = self.model( (EngineCore pid=396927) ^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/compilation/decorators.py", line 507, in call (EngineCore pid=396927) return self.forward(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1405, in forward (EngineCore pid=396927) hidden_states, residual, post_mix, res_mix = layer( (EngineCore pid=396927) ^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1779, in _wrapped_call_impl (EngineCore pid=396927) return self._call_impl(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1790, in _call_impl (EngineCore pid=396927) return forward_call(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1213, in forward (EngineCore pid=396927) x, post_mix, res_mix = self.hc_pre( (EngineCore pid=396927) ^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/models/deepseek_v4.py", line 1179, in hc_pre (EngineCore pid=396927) post_mix, res_mix, layer_input = torch.ops.vllm.mhc_pre( (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/torch/_ops.py", line 1269, in call (EngineCore pid=396927) return self._op(*args, **kwargs) (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/model_executor/layers/mhc.py", line 310, in mhc_pre (EngineCore pid=396927) tf32_hc_prenorm_gemm( (EngineCore pid=396927) File "/root/wayne/deepseekv4vllm/vllm/vllm/utils/deep_gemm.py", line 477, in tf32_hc_prenorm_gemm (EngineCore pid=396927) return _tf32_hc_prenorm_gemm_impl( (EngineCore pid=396927) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=396927) RuntimeError: Assertion error (csrc/apis/hyperconnection.hpp:56): Unsupported architecture [rank0]:[W512 14:50:24.024626144 ProcessGroupNCCL.cpp:1575] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=396885) Traceback (most recent call last): (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/bin/vllm", line 8, in <module> (APIServer pid=396885) sys.exit(main()) (APIServer pid=396885) ^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/main.py", line 92, in main (APIServer pid=396885) args.dispatch_function(args) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/cli/serve.py", line 122, in cmd (APIServer pid=396885) uvloop.run(run_server(args)) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/init.py", line 96, in run (APIServer pid=396885) return __asyncio.run( (APIServer pid=396885) ^^^^^^^^^^^^^^ (APIServer pid=396885) File "/usr/lib/python3.12/asyncio/runners.py", line 194, in run (APIServer pid=396885) return runner.run(main) (APIServer pid=396885) ^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=396885) return self._loop.run_until_complete(task) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/lib/python3.12/site-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=396885) return await main (APIServer pid=396885) ^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 679, in run_server (APIServer pid=396885) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 693, in run_server_worker (APIServer pid=396885) async with build_async_engine_client( (APIServer pid=396885) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=396885) return await anext(self.gen) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 100, in build_async_engine_client (APIServer pid=396885) async with build_async_engine_client_from_engine_args( (APIServer pid=396885) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=396885) return await anext(self.gen) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/entrypoints/openai/api_server.py", line 136, in build_async_engine_client_from_engine_args (APIServer pid=396885) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 217, in from_vllm_config (APIServer pid=396885) return cls( (APIServer pid=396885) ^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/async_llm.py", line 146, in init (APIServer pid=396885) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=396885) return func(*args, **kwargs) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 130, in make_async_mp_client (APIServer pid=396885) return AsyncMPClient(*client_args) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=396885) return func(*args, **kwargs) (APIServer pid=396885) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 900, in init (APIServer pid=396885) super().init( (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/core_client.py", line 535, in init (APIServer pid=396885) with launch_core_engines( (APIServer pid=396885) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=396885) next(self.gen) (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1133, in launch_core_engines (APIServer pid=396885) wait_for_engine_startup( (APIServer pid=396885) File "/root/wayne/deepseekv4vllm/vllm/vllm/v1/engine/utils.py", line 1192, in wait_for_engine_startup (APIServer pid=396885) raise RuntimeError( (APIServer pid=396885) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {} (deepseekv4vllm) root@ubuntugpu-h2-96G-01:/wayne/deepseekv4vllm/vllm#


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: deepseek v4 failed to work on R6000 GPU