vllm - 💡(How to fix) Fix [Bug]: v0.17.0 4*2080ti 22G Qwen3.5 RPC call to sample_tokens timed out. [7 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36631Fetched 2026-04-08 00:35:48
View on GitHub
Comments
7
Participants
4
Timeline
21
Reactions
0
Timeline (top)
commented ×7mentioned ×5subscribed ×5renamed ×2

Error Message

(APIServer pid=1) INFO: 192.168.5.200:61737 - "POST /v1/chat/completions HTTP/1.1" 200 OK (EngineCore_DP0 pid=188) INFO 03-10 09:22:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=188) INFO 03-10 09:23:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=188) INFO 03-10 09:24:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=188) INFO 03-10 09:25:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', speculative_config=None, tokenizer='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-aa3f62a4eb5bd12f-a29b0d57,prompt_token_ids_len=17,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131055, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-aa3f62a4eb5bd12f-a29b0d57: 17}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null) (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0017937219730941312, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] status, result = mq.dequeue( (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] with self.acquire_read(timeout, cancel, indefinite) as buf: (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/lib/python3.12/contextlib.py", line 137, in enter (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] return next(self.gen) (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] raise TimeoutError (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] The above exception was the direct cause of the following exception: (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] return super().result() (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] raise self._exception (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] response = self.aggregate(get_response()) (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in get_response (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] raise TimeoutError(f"RPC call to {method} timed out.") from e (EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError: RPC call to sample_tokens timed out. (Worker pid=275) (Worker_TP0 pid=275) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker (Worker pid=295) (Worker_TP2 pid=295) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker (Worker pid=279) (Worker_TP1 pid=279) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker (Worker pid=315) (Worker_TP3 pid=315) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (Worker pid=279) (Worker_TP1 pid=279) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated (Worker pid=295) (Worker_TP2 pid=295) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated (Worker pid=275) (Worker_TP0 pid=275) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated (Worker pid=315) (Worker_TP3 pid=315) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1]

Root Cause

(APIServer pid=1) INFO:     192.168.5.200:61737 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=188) INFO 03-10 09:22:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:23:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:24:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:25:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', speculative_config=None, tokenizer='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-aa3f62a4eb5bd12f-a29b0d57,prompt_token_ids_len=17,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131055, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-aa3f62a4eb5bd12f-a29b0d57: 17}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0017937219730941312, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     status, result = mq.dequeue(
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return next(self.gen)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise TimeoutError
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     self._process_engine_step()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     model_output = future.result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return super().result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return self.__get_result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise self._exception
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in get_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError: RPC call to sample_tokens timed out.
(Worker pid=275) (Worker_TP0 pid=275) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=295) (Worker_TP2 pid=295) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=279) (Worker_TP1 pid=279) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=315) (Worker_TP3 pid=315) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker pid=279) (Worker_TP1 pid=279) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=295) (Worker_TP2 pid=295) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=275) (Worker_TP0 pid=275) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=315) (Worker_TP3 pid=315) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]

Code Example

docker run -d --name vllm-Qwen3.5 \
  --gpus all \
  --ipc=host \
  --shm-size=64g \
  -p 11435:8000 \
  -v /home/ubuntu/models:/models \
  vllm/vllm-openai:v0.17.0 \
  --model /models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit \
  --served-model-name Qwen3.5-35B-A3B \
  --api-key token-llm \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --dtype half \
  --kv-cache-dtype auto \
  --block-size 16 \
  --swap-space 4 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --max-cudagraph-capture-size 8 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --override-generation-config '{"temperature":0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'

---

(APIServer pid=1) INFO:     192.168.5.200:61737 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=188) INFO 03-10 09:22:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:23:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:24:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:25:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', speculative_config=None, tokenizer='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-aa3f62a4eb5bd12f-a29b0d57,prompt_token_ids_len=17,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131055, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-aa3f62a4eb5bd12f-a29b0d57: 17}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0017937219730941312, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     status, result = mq.dequeue(
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return next(self.gen)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise TimeoutError
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     self._process_engine_step()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     model_output = future.result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return super().result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return self.__get_result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise self._exception
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in get_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError: RPC call to sample_tokens timed out.
(Worker pid=275) (Worker_TP0 pid=275) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=295) (Worker_TP2 pid=295) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=279) (Worker_TP1 pid=279) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=315) (Worker_TP3 pid=315) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker pid=279) (Worker_TP1 pid=279) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=295) (Worker_TP2 pid=295) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=275) (Worker_TP0 pid=275) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=315) (Worker_TP3 pid=315) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]
RAW_BUFFERClick to expand / collapse

Your current environment

docker run -d --name vllm-Qwen3.5 \
  --gpus all \
  --ipc=host \
  --shm-size=64g \
  -p 11435:8000 \
  -v /home/ubuntu/models:/models \
  vllm/vllm-openai:v0.17.0 \
  --model /models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit \
  --served-model-name Qwen3.5-35B-A3B \
  --api-key token-llm \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --dtype half \
  --kv-cache-dtype auto \
  --block-size 16 \
  --swap-space 4 \
  --max-model-len 32768 \
  --max-num-seqs 4 \
  --max-cudagraph-capture-size 8 \
  --reasoning-parser qwen3 \
  --tool-call-parser qwen3_coder \
  --enable-auto-tool-choice \
  --default-chat-template-kwargs '{"enable_thinking": false}' \
  --override-generation-config '{"temperature":0.7,"top_p":0.8,"top_k":20,"min_p":0.0,"presence_penalty":1.5,"repetition_penalty":1.0}'

🐛 Describe the bug

Log:

(APIServer pid=1) INFO:     192.168.5.200:61737 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(EngineCore_DP0 pid=188) INFO 03-10 09:22:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:23:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:24:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) INFO 03-10 09:25:02 [shm_broadcast.py:548] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.0) with config: model='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', speculative_config=None, tokenizer='/models/Qwen/Qwen3.5-35B-A3B-AWQ-4bit', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=4, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='qwen3', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=Qwen3.5-35B-A3B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 8, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=chatcmpl-aa3f62a4eb5bd12f-a29b0d57,prompt_token_ids_len=17,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.2, top_p=0.95, top_k=20, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=131055, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={chatcmpl-aa3f62a4eb5bd12f-a29b0d57: 17}, total_num_scheduled_tokens=17, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0017937219730941312, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 366, in get_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     status, result = mq.dequeue(
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                      ^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 622, in dequeue
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     with self.acquire_read(timeout, cancel, indefinite) as buf:
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return next(self.gen)
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/distributed/device_communicators/shm_broadcast.py", line 542, in acquire_read
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise TimeoutError
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] The above exception was the direct cause of the following exception:
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] 
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     self._process_engine_step()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core.py", line 497, in step_with_batch_queue
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     model_output = future.result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 83, in result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return super().result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     return self.__get_result()
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise self._exception
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 87, in wait_for_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     response = self.aggregate(get_response())
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 370, in get_response
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore_DP0 pid=188) ERROR 03-10 09:26:02 [core.py:1102] TimeoutError: RPC call to sample_tokens timed out.
(Worker pid=275) (Worker_TP0 pid=275) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=295) (Worker_TP2 pid=295) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=279) (Worker_TP1 pid=279) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(Worker pid=315) (Worker_TP3 pid=315) INFO 03-10 09:26:02 [multiproc_executor.py:749] Parent process exited, terminating worker
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-10 09:26:02 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(Worker pid=279) (Worker_TP1 pid=279) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=295) (Worker_TP2 pid=295) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=275) (Worker_TP0 pid=275) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(Worker pid=315) (Worker_TP3 pid=315) WARNING 03-10 09:26:06 [multiproc_executor.py:814] WorkerProc was terminated
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]

Frequently sees 'No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization)', and an exception appears after waiting.

All Passed for 4 * L20 And Passed for 4 * 2080ti 22g version: 0.16.1rc1.dev123+ga60985b07

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue seems to be related to a timeout error in the shm_broadcast.py module, which is likely caused by a deadlock or a process hanging. To fix this issue, we can try the following steps:

  • Increase the timeout value in the shm_broadcast.py module to give the processes more time to complete their tasks.
  • Check for any deadlocks or processes that are hanging and causing the timeout error.
  • Update the vllm library to the latest version, as the issue seems to be fixed in version 0.16.1rc1.dev123+ga60985b07.

Here is an example of how to increase the timeout value:

# In shm_broadcast.py, increase the timeout value
def acquire_read(self, timeout=120, cancel=False, indefinite=False):
    # Increase the timeout value from 60 to 120 seconds
    timeout = 120
    ...

Alternatively, you can try to update the vllm library to the latest version using the following command:

pip install --upgrade vllm

Verification

To verify that the fix worked, you can try running the same command that caused the error and check if the error still occurs. If the error does not occur, it means that the fix was successful.

You can also try to monitor the processes and check if there are any deadlocks or processes that are hanging. You can use tools like top or htop to monitor the processes and check their status.

Extra Tips

  • Make sure to check the documentation and the issues page for any known issues or fixes related to the error you are experiencing.
  • Try to reproduce the error and provide as much information as possible to help the developers debug and fix the issue.
  • If you are using a GPU, make sure that it is properly configured and that the drivers are up to date.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING