vllm - 💡(How to fix) Fix [Bug]: vllm 0.17.0 部署 Qwen3.5 397b-fp8版本运行过程中异常崩溃(vllm 0.17.0 crashed unexpectedly during deployment of Qwen3.5 397b-fp8 version.) [6 comments, 4 participants]

vllm2026-03-09 11:29:51

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36489•Fetched 2026-04-08 00:36:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×6subscribed ×3mentioned ×2renamed ×2

Error Message

(EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/models/Qwen3.5-397B-A17B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='/models/Qwen3.5-397B-A17B-FP8', num_spec_tokens=2), tokenizer='/models/Qwen3.5-397B-A17B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nuwa, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-af8de0e56abe572d-873617d0,prompt_token_ids_len=6965,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=1.5, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([0, 0, 1069, 1070, 1071], [0, 0, 1072, 1073, 1074], [0, 0, 1075, 1076, 1077], [1078, 1079, 1080]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-af8de0e56abe572d-873617d0: 1584}, total_num_scheduled_tokens=1584, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.009591115598182709, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=6965, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] return super().result() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] raise self._exception (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] response = self.aggregate(get_response()) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] status, result = mq.dequeue( (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue

Root Cause

Below is the crash log: APIServer pid=2682023) INFO: 172.25.177.216:40184 - "GET /v1/models HTTP/1.1" 200 OK (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/models/Qwen3.5-397B-A17B-FP8', speculative_config=SpeculativeConfig(method='mtp', model='/models/Qwen3.5-397B-A17B-FP8', num_spec_tokens=2), tokenizer='/models/Qwen3.5-397B-A17B-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=8, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=nuwa, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['+quant_fp8', 'none', '+quant_fp8', '+quant_fp8', '+quant_fp8'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-af8de0e56abe572d-873617d0,prompt_token_ids_len=6965,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=1.5, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.7, top_p=0.8, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=4096, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([0, 0, 1069, 1070, 1071], [0, 0, 1072, 1073, 1074], [0, 0, 1075, 1076, 1077], [1078, 1079, 1080]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-af8de0e56abe572d-873617d0: 1584}, total_num_scheduled_tokens=1584, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 3], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.009591115598182709, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=1, queries=6965, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] return super().result() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] raise self._exception (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] response = self.aggregate(get_response()) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] status, result = mq.dequeue( (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] with self.acquire_read(timeout, cancel, indefinite) as buf: (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/contextlib.py", line 137, in enter (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] return next(self.gen) (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 537, in acquire_read (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] raise RuntimeError("cancelled") (EngineCore_DP0 pid=2682128) ERROR 03-09 19:11:40 [core.py:1102] RuntimeError: cancelled (EngineCore_DP0 pid=2682128) Process EngineCore_DP0: (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=2682023) ERROR 03-09 19:11:40 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (EngineCore_DP0 pid=2682128) Traceback (most recent call last): (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=2682128) self.run() (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=2682128) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=2682128) raise e (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=2682128) engine_core.run_busy_loop() (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=2682128) self._process_engine_step() (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=2682128) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue (EngineCore_DP0 pid=2682128) model_output = future.result() (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result (EngineCore_DP0 pid=2682128) return super().result() (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=2682128) return self.__get_result() (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=2682128) raise self._exception (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response (EngineCore_DP0 pid=2682128) response = self.aggregate(get_response()) (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response (EngineCore_DP0 pid=2682128) status, result = mq.dequeue( (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue (EngineCore_DP0 pid=2682128) with self.acquire_read(timeout, cancel, indefinite) as buf: (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/contextlib.py", line 137, in enter (EngineCore_DP0 pid=2682128) return next(self.gen) (EngineCore_DP0 pid=2682128) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=2682128) File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 537, in acquire_read (EngineCore_DP0 pid=2682128) raise RuntimeError("cancelled") (EngineCore_DP0 pid=2682128) RuntimeError: cancelled (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] Error in chat completion stream generator. (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] Traceback (most recent call last): (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/serving.py", line 714, in chat_completion_stream_generator (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] async for res in result_generator: (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 583, in generate (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] out = q.get_nowait() or await q.get() (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] ^^^^^^^^^^^^^ (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/output_processor.py", line 85, in get (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] raise output (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] outputs = await engine_core.get_output_async() (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] File "/root/miniconda3/envs/wlvllm/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] raise self._format_exception(outputs) from None (APIServer pid=2682023) ERROR 03-09 19:11:40 [serving.py:1390] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=2682023) INFO: 172.25.177.217:33862 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=2682023) INFO: 172.25.177.217:33864 - "POST /v1/chat/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=2682023) INFO: Shutting down (APIServer pid=2682023) INFO: Waiting for application shutdown. (APIServer pid=2682023) INFO: Application shutdown complete. (APIServer pid=2682023) INFO: Finished server process [2682023] /root/miniconda3/envs/wlvllm/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 8 leaked semaphore objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d ' /root/miniconda3/envs/wlvllm/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 9 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Code Example

Your output of `python collect_env.py` here

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

运行一段时间后，出现下面日志后异常退出。运行设备A100，运行命令如下，vllm版本0.17.0版本。 After running for a period of time, the system crashed due to the following log entry. The command used was to run device A100, with vllm version 0.17.0.

cmd： CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 nohup python3 -m vllm.entrypoints.openai.api_server --model "/models/Qwen3.5-397B-A17B-FP8" --served-model-name nuwa --port 60001 --enable-prefix-caching --tensor-parallel-size 8 --enable-auto-tool-choice --tool-call-parser qwen3_coder --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}' --reasoning-parser qwen3 --gpu-memory-utilization 0.85 --language-model-only > vllmQwen35-397B.log 2>&1 &

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The error message indicates a RuntimeError: cancelled exception, which suggests that there's an issue with the asynchronous processing of requests. To fix this, we can try the following steps:

Increase the timeout value: In the shm_broadcast.py file, increase the timeout value in the acquire_read method to allow for more time to process the requests.
Check for resource leaks: The warning messages indicate that there are leaked semaphore and shared memory objects. Make sure to properly clean up these resources to prevent memory leaks.
Update the vllm library: Ensure that you're using the latest version of the vllm library, as this issue might have been fixed in a newer version.

Here's an example code snippet that demonstrates how to increase the timeout value:

# In shm_broadcast.py
def acquire_read(self, timeout=60, cancel=False, indefinite=False):
    # Increase the timeout value to 300 seconds
    timeout = 300
    # ... rest of the method remains the same ...

Additionally, you can try to add error handling to catch and handle the RuntimeError: cancelled exception:

try:
    # Code that might raise the exception
    outputs = await engine_core.get_output_async()
except RuntimeError as e:
    if "cancelled" in str(e):
        # Handle the cancelled exception
        print("Request was cancelled")
    else:
        raise

Verification

To verify that the fix worked, you can try running the same command that caused the error and check if the issue persists. You can also add logging statements to track the execution of the code and identify any potential issues.

Extra Tips

Make sure to properly clean up resources to prevent memory leaks.
Consider adding retry logic to handle temporary errors and improve the robustness of your code.
If you're still experiencing issues, try to isolate the problem by running a minimal example that reproduces the error.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: vllm 0.17.0 部署 Qwen3.5 397b-fp8版本运行过程中异常崩溃(vllm 0.17.0 crashed unexpectedly during deployment of Qwen3.5 397b-fp8 version.) [6 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vllm 0.17.0 部署 Qwen3.5 397b-fp8版本运行过程中异常崩溃(vllm 0.17.0 crashed unexpectedly during deployment of Qwen3.5 397b-fp8 version.) [6 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING