vllm - 💡(How to fix) Fix [Bug]:TimeoutError: RPC call to sample_tokens timed out. when pp is on under xpu env [1 participants]

vllm2026-04-04 10:25:39

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38976•Fetched 2026-04-08 02:44:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

zwh20081

Participants

zwh20081

Timeline (top)

labeled ×1renamed ×1

Error Message

WARNING 04-04 18:12:14 [argparse_utils.py:191] With vllm serve, you should provide the model as a positional argument or in a config file instead of via the --model option. The --model option will be removed in v0.13. (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299] (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299] █ █ █▄ ▄█ (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.1rc1.dev29+g93726b2a1 (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299] █▄█▀ █ █ █ █ model /data/models/gpt-oss-20b/ (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299] (APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:233] non-default args: {'model_tag': '/data/models/gpt-oss-20b/', 'host': '0.0.0.0', 'model': '/data/models/gpt-oss-20b/', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'disable_sliding_window': True, 'distributed_executor_backend': 'mp', 'pipeline_parallel_size': 2, 'tensor_parallel_size': 2, 'block_size': 64, 'gpu_memory_utilization': 0.87, 'max_num_batched_tokens': 8192} (APIServer pid=335596) INFO 04-04 18:12:14 [model.py:554] Resolved architecture: GptOssForCausalLM (APIServer pid=335596) INFO 04-04 18:12:14 [model.py:1685] Using max model len 8192 (APIServer pid=335596) INFO 04-04 18:12:14 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192. (APIServer pid=335596) INFO 04-04 18:12:14 [config.py:126] Overriding max cuda graph capture size to 1024 for performance. (APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:799] Asynchronous scheduling is enabled. (APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=335596) INFO 04-04 18:12:14 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']) (APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:1046] Cudagraph is disabled under eager mode (APIServer pid=335596) WARNING 04-04 18:12:14 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode. (APIServer pid=335596) INFO 04-04 18:12:14 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant (EngineCore pid=336188) INFO 04-04 18:12:19 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto') (EngineCore pid=336188) INFO 04-04 18:12:19 [multiproc_executor.py:137] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.5.175 (local), world_size=4, local_world_size=4 (Worker pid=336700) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl (Worker pid=336699) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl (Worker pid=336697) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl (Worker pid=336698) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl (Worker pid=336697) INFO 04-04 18:12:25 [parallel_state.py:1712] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A 2026:04:04-18:12:25:336697 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi) 2026:04:04-18:12:25:336697 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2026:04:04-18:12:25:336698 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi) 2026:04:04-18:12:25:336698 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2026:04:04-18:12:25:336700 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi) 2026:04:04-18:12:25:336700 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2026:04:04-18:12:25:336699 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi) 2026:04:04-18:12:25:336699 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL 2026:04:04-18:12:26:336698:[1] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2026:04:04-18:12:26:336699:[2] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2026:04:04-18:12:26:336697:[0] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices 2026:04:04-18:12:26:336700:[3] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:26 [gpu_model_runner.py:4735] Starting to load model /data/models/gpt-oss-20b/... (Worker_PP0_TP1 pid=336698) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels. (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels. (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:81] Using Flash Attention backend. (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [flash_attn.py:622] Using FlashAttention version 2 (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [mxfp4.py:352] Using 'XPU' Mxfp4 MoE backend. (Worker_PP1_TP0 pid=336699) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels. (Worker_PP1_TP1 pid=336700) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels. Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s] Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:00, 2.23it/s] Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:00<00:00, 2.29it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.64it/s] Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.77it/s] (Worker_PP0_TP0 pid=336697) (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [default_loader.py:384] Loading weights took 1.75 seconds (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [mxfp4.py:836] Using MoEPrepareAndFinalizeNoDPEPModular (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [gpu_model_runner.py:4820] Model loading took 3.85 GiB memory and 2.185793 seconds (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [gpu_worker.py:436] Available KV cache memory: 14.86 GiB (EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1319] GPU KV cache size: 1,222,720 tokens (EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1324] Maximum concurrency for 8,192 tokens per request: 149.26x (Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [utils.py:60] _KV_CACHE_LAYOUT_OVERRIDE variable detected. Setting KV cache layout to NHD. (EngineCore pid=336188) INFO 04-04 18:12:32 [core.py:283] init engine (profile, create kv cache, warmup model) took 3.13 seconds (EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:799] Asynchronous scheduling is enabled. (EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (EngineCore pid=336188) INFO 04-04 18:12:33 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']) (EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:1046] Cudagraph is disabled under eager mode (EngineCore pid=336188) WARNING 04-04 18:12:33 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode. (EngineCore pid=336188) INFO 04-04 18:12:33 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant (APIServer pid=335596) INFO 04-04 18:12:33 [api_server.py:604] Supported tasks: ['generate'] (APIServer pid=335596) WARNING 04-04 18:12:33 [serving.py:233] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use. (APIServer pid=335596) INFO 04-04 18:12:34 [hf.py:314] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=335596) INFO 04-04 18:12:34 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000 (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:37] Available routes are: (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /tokenize, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /detokenize, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /load, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /version, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /health, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /metrics, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/models, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /invocations, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions/render, Methods: POST (APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /generative_scoring, Methods: POST (APIServer pid=335596) INFO: Started server process [335596] (APIServer pid=335596) INFO: Waiting for application startup. (APIServer pid=335596) INFO: Application startup complete. (APIServer pid=335596) INFO: 127.0.0.1:34608 - "GET /metrics HTTP/1.1" 200 OK (APIServer pid=335596) INFO: 127.0.0.1:34610 - "POST /v1/completions HTTP/1.1" 200 OK (Worker_PP0_TP0 pid=336697) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.) (Worker_PP0_TP0 pid=336697) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) (Worker_PP0_TP1 pid=336698) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.) (Worker_PP0_TP1 pid=336698) object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8) (APIServer pid=335596) INFO 04-04 18:13:04 [loggers.py:259] Engine 000: Avg prompt throughput: 102.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (APIServer pid=335596) INFO 04-04 18:13:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0% (EngineCore pid=336188) INFO 04-04 18:13:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore pid=336188) INFO 04-04 18:14:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore pid=336188) INFO 04-04 18:15:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore pid=336188) INFO 04-04 18:16:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization). (EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto'), (EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-bench-dd7f173a-0-0-8a43c102'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([17],)],num_computed_tokens=[1024],num_output_tokens=[1]), num_scheduled_tokens={cmpl-bench-dd7f173a-0-0-8a43c102: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[17], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null) (EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0008898659966498634, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] EngineCore encountered a fatal error. (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last): (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 394, in get_response (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] status, result = mq.dequeue(timeout=dequeue_timeout) (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] with self.acquire_read(timeout, indefinite) as buf: (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/contextlib.py", line 137, in enter (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] return next(self.gen) (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms()) (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] raise TimeoutError (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] The above exception was the direct cause of the following exception: (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last): (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1101, in run_engine_core (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] engine_core.run_busy_loop() (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1142, in run_busy_loop (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] self._process_engine_step() (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1181, in _process_engine_step (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] outputs, model_executed = self.step_fn() (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/engine/core.py", line 499, in step_with_batch_queue (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] model_output = future.result() (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 88, in result (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] return super().result() (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] return self.__get_result() (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] raise self._exception (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 92, in _wait_for_response (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] response = self.aggregate(self.get_response()) (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 396, in get_response (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] raise TimeoutError(f"RPC call to {method} timed out.") from e (EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError: RPC call to sample_tokens timed out. (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] AsyncLLM output_handler failed. (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] Traceback (most recent call last): (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] File "/home/cjai/vllm/vllm/v1/engine/async_llm.py", line 663, in output_handler (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] outputs = await engine_core.get_output_async() (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] File "/home/cjai/vllm/vllm/v1/engine/core_client.py", line 970, in get_output_async (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] raise self._format_exception(outputs) from None (APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=335596) INFO: Shutting down (APIServer pid=335596) INFO: Waiting for application shutdown. (APIServer pid=335596) INFO: Application shutdown complete. (APIServer pid=335596) INFO: Finished server process [335596]

Root Cause

WARNING 04-04 18:12:14 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev29+g93726b2a1
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]   █▄█▀ █     █     █     █  model   /data/models/gpt-oss-20b/
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:233] non-default args: {'model_tag': '/data/models/gpt-oss-20b/', 'host': '0.0.0.0', 'model': '/data/models/gpt-oss-20b/', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'disable_sliding_window': True, 'distributed_executor_backend': 'mp', 'pipeline_parallel_size': 2, 'tensor_parallel_size': 2, 'block_size': 64, 'gpu_memory_utilization': 0.87, 'max_num_batched_tokens': 8192}
(APIServer pid=335596) INFO 04-04 18:12:14 [model.py:554] Resolved architecture: GptOssForCausalLM
(APIServer pid=335596) INFO 04-04 18:12:14 [model.py:1685] Using max model len 8192
(APIServer pid=335596) INFO 04-04 18:12:14 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=335596) INFO 04-04 18:12:14 [config.py:126] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=335596) INFO 04-04 18:12:14 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native'])
(APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:1046] Cudagraph is disabled under eager mode
(APIServer pid=335596) WARNING 04-04 18:12:14 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode.
(APIServer pid=335596) INFO 04-04 18:12:14 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=336188) INFO 04-04 18:12:19 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=336188) INFO 04-04 18:12:19 [multiproc_executor.py:137] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.5.175 (local), world_size=4, local_world_size=4
(Worker pid=336700) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336699) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336697) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336698) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336697) INFO 04-04 18:12:25 [parallel_state.py:1712] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
2026:04:04-18:12:25:336697 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336697 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336698 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336698 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336700 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336700 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336699 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336699 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:26:336698:[1] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336699:[2] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336697:[0] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336700:[3] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:26 [gpu_model_runner.py:4735] Starting to load model /data/models/gpt-oss-20b/...
(Worker_PP0_TP1 pid=336698) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:81] Using Flash Attention backend.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [flash_attn.py:622] Using FlashAttention version 2
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [mxfp4.py:352] Using 'XPU' Mxfp4 MoE backend.
(Worker_PP1_TP0 pid=336699) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP1_TP1 pid=336700) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:00<00:00,  2.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.77it/s]
(Worker_PP0_TP0 pid=336697)
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [default_loader.py:384] Loading weights took 1.75 seconds
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [mxfp4.py:836] Using MoEPrepareAndFinalizeNoDPEPModular
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [gpu_model_runner.py:4820] Model loading took 3.85 GiB memory and 2.185793 seconds
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [gpu_worker.py:436] Available KV cache memory: 14.86 GiB
(EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1319] GPU KV cache size: 1,222,720 tokens
(EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1324] Maximum concurrency for 8,192 tokens per request: 149.26x
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to NHD.
(EngineCore pid=336188) INFO 04-04 18:12:32 [core.py:283] init engine (profile, create kv cache, warmup model) took 3.13 seconds
(EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=336188) INFO 04-04 18:12:33 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native'])
(EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:1046] Cudagraph is disabled under eager mode
(EngineCore pid=336188) WARNING 04-04 18:12:33 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode.
(EngineCore pid=336188) INFO 04-04 18:12:33 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=335596) INFO 04-04 18:12:33 [api_server.py:604] Supported tasks: ['generate']
(APIServer pid=335596) WARNING 04-04 18:12:33 [serving.py:233] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=335596) INFO 04-04 18:12:34 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=335596) INFO 04-04 18:12:34 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:37] Available routes are:
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=335596) INFO:     Started server process [335596]
(APIServer pid=335596) INFO:     Waiting for application startup.
(APIServer pid=335596) INFO:     Application startup complete.
(APIServer pid=335596) INFO:     127.0.0.1:34608 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=335596) INFO:     127.0.0.1:34610 - "POST /v1/completions HTTP/1.1" 200 OK
(Worker_PP0_TP0 pid=336697) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.)
(Worker_PP0_TP0 pid=336697)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(Worker_PP0_TP1 pid=336698) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.)
(Worker_PP0_TP1 pid=336698)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(APIServer pid=335596) INFO 04-04 18:13:04 [loggers.py:259] Engine 000: Avg prompt throughput: 102.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=335596) INFO 04-04 18:13:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(EngineCore pid=336188) INFO 04-04 18:13:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:14:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:15:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:16:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto'),
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-bench-dd7f173a-0-0-8a43c102'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([17],)],num_computed_tokens=[1024],num_output_tokens=[1]), num_scheduled_tokens={cmpl-bench-dd7f173a-0-0-8a43c102: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[17], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0008898659966498634, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] EngineCore encountered a fatal error.
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 394, in get_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return next(self.gen)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                                          ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise TimeoutError
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] The above exception was the direct cause of the following exception:
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1101, in run_engine_core
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     engine_core.run_busy_loop()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1142, in run_busy_loop
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     self._process_engine_step()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1181, in _process_engine_step
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     outputs, model_executed = self.step_fn()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                               ^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 499, in step_with_batch_queue
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     model_output = future.result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                    ^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 88, in result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return super().result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return self.__get_result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise self._exception
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 92, in _wait_for_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     response = self.aggregate(self.get_response())
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 396, in get_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError: RPC call to sample_tokens timed out.
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]   File "/home/cjai/vllm/vllm/v1/engine/async_llm.py", line 663, in output_handler
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]     outputs = await engine_core.get_output_async()
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]   File "/home/cjai/vllm/vllm/v1/engine/core_client.py", line 970, in get_output_async
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]     raise self._format_exception(outputs) from None
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=335596) INFO:     Shutting down
(APIServer pid=335596) INFO:     Waiting for application shutdown.
(APIServer pid=335596) INFO:     Application shutdown complete.
(APIServer pid=335596) INFO:     Finished server process [335596]

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 48 On-line CPU(s) list: 0-47 Vendor ID: GenuineIntel Model name: Intel(R) Xeon(R) w7-3455 CPU family: 6 Model: 143 Thread(s) per core: 2 Core(s) per socket: 24 Socket(s): 1 Stepping: 5 CPU(s) scaling MHz: 25% CPU max MHz: 4800.0000 CPU min MHz: 800.0000 BogoMIPS: 4992.00 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities Virtualization: VT-x L1d cache: 1.1 MiB (24 instances) L1i cache: 768 KiB (24 instances) L2 cache: 48 MiB (24 instances) L3 cache: 67.5 MiB (1 instance) NUMA node(s): 1 NUMA node0 CPU(s): 0-47 Vulnerability Gather data sampling: Not affected Vulnerability Ghostwrite: Not affected Vulnerability Indirect target selection: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Old microcode: Vulnerable Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S Vulnerability Srbds: Not affected Vulnerability Tsa: Not affected Vulnerability Tsx async abort: Not affected Vulnerability Vmscape: Mitigation; IBPB before exit to userspace

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 25.10 (x86_64)
GCC version                  : (Ubuntu 15.2.0-4ubuntu4) 15.2.0
Clang version                : Could not collect
CMake version                : version 4.3.1
Libc version                 : glibc-2.42

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+xpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 | packaged by Anaconda, Inc. | (main, Mar 19 2026, 20:20:58) [GCC 14.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-20-generic-x86_64-with-glibc2.42

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) w7-3455
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               1
Stepping:                                5
CPU(s) scaling MHz:                      25%
CPU max MHz:                             4800.0000
CPU min MHz:                             800.0000
BogoMIPS:                                4992.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               1.1 MiB (24 instances)
L1i cache:                               768 KiB (24 instances)
L2 cache:                                48 MiB (24 instances)
L3 cache:                                67.5 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-47
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Vulnerable
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.4.3
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+xpu
[pip3] torchao==0.17.0+xpu
[pip3] torchaudio==2.11.0+xpu
[pip3] torchvision==0.26.0+xpu
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[pip3] triton-xpu==3.7.0
[conda] mkl                                         2025.3.1                        pypi_0           pypi
[conda] numpy                                       2.4.3                           pypi_0           pypi
[conda] onemkl-license                              2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-blas                            2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-dft                             2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-lapack                          2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-rng                             2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-sparse                          2025.3.1                        pypi_0           pypi
[conda] pyzmq                                       27.1.0                          pypi_0           pypi
[conda] torch                                       2.11.0+xpu                      pypi_0           pypi
[conda] torchao                                     0.17.0+xpu                      pypi_0           pypi
[conda] torchaudio                                  2.11.0+xpu                      pypi_0           pypi
[conda] torchvision                                 0.26.0+xpu                      pypi_0           pypi
[conda] transformers                                4.57.6                          pypi_0           pypi
[conda] triton                                      3.6.0                           pypi_0           pypi
[conda] triton-xpu                                  3.7.0                           pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev29+g93726b2a1 (git sha: 93726b2a1)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_XPU_ENABLE_XPU_GRAPH=1
LD_LIBRARY_PATH=/opt/intel/oneapi/tcm/1.4/lib:/opt/intel/oneapi/umf/1.0/lib:/opt/intel/oneapi/tbb/2022.3/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.17/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.17/lib:/opt/intel/oneapi/mkl/2025.3/lib:/opt/intel/oneapi/ippcp/2025.3/lib/:/opt/intel/oneapi/ipp/2022.3/lib:/opt/intel/oneapi/dnnl/2025.3/lib:/opt/intel/oneapi/debugger/2025.3/opt/debugger/lib:/opt/intel/oneapi/dal/2025.10/lib:/opt/intel/oneapi/compiler/2025.3/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.3/lib:/opt/intel/oneapi/ccl/2021.17/lib/:/home/cjai/miniforge3/envs/vllm/lib:/opt/intel/oneapi/tcm/1.4/lib:/opt/intel/oneapi/umf/1.0/lib:/opt/intel/oneapi/tbb/2022.3/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.17/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.17/lib:/opt/intel/oneapi/mkl/2025.3/lib:/opt/intel/oneapi/ippcp/2025.3/lib/:/opt/intel/oneapi/ipp/2022.3/lib:/opt/intel/oneapi/dnnl/2025.3/lib:/opt/intel/oneapi/debugger/2025.3/opt/debugger/lib:/opt/intel/oneapi/dal/2025.10/lib:/opt/intel/oneapi/compiler/2025.3/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.3/lib:/opt/intel/oneapi/ccl/2021.17/lib/
OMP_NUM_THREADS=48
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_cjai

---

vllm serve     --model /data/models/gpt-oss-20b/     --enforce-eager     --port 8000     --host 0.0.0.0     --trust-remote-code     --disable-sliding-window     --gpu-memory-util=0.87     --max-num-batched-tokens=8192         --max-model-len=8192     --block-size 64        -tp=2 -pp 2   --distributed-executor-backend=mp

---

WARNING 04-04 18:12:14 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev29+g93726b2a1
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]   █▄█▀ █     █     █     █  model   /data/models/gpt-oss-20b/
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:233] non-default args: {'model_tag': '/data/models/gpt-oss-20b/', 'host': '0.0.0.0', 'model': '/data/models/gpt-oss-20b/', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'disable_sliding_window': True, 'distributed_executor_backend': 'mp', 'pipeline_parallel_size': 2, 'tensor_parallel_size': 2, 'block_size': 64, 'gpu_memory_utilization': 0.87, 'max_num_batched_tokens': 8192}
(APIServer pid=335596) INFO 04-04 18:12:14 [model.py:554] Resolved architecture: GptOssForCausalLM
(APIServer pid=335596) INFO 04-04 18:12:14 [model.py:1685] Using max model len 8192
(APIServer pid=335596) INFO 04-04 18:12:14 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=335596) INFO 04-04 18:12:14 [config.py:126] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=335596) INFO 04-04 18:12:14 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native'])
(APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:1046] Cudagraph is disabled under eager mode
(APIServer pid=335596) WARNING 04-04 18:12:14 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode.
(APIServer pid=335596) INFO 04-04 18:12:14 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=336188) INFO 04-04 18:12:19 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=336188) INFO 04-04 18:12:19 [multiproc_executor.py:137] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.5.175 (local), world_size=4, local_world_size=4
(Worker pid=336700) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336699) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336697) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336698) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336697) INFO 04-04 18:12:25 [parallel_state.py:1712] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
2026:04:04-18:12:25:336697 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336697 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336698 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336698 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336700 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336700 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336699 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336699 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:26:336698:[1] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336699:[2] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336697:[0] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336700:[3] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:26 [gpu_model_runner.py:4735] Starting to load model /data/models/gpt-oss-20b/...
(Worker_PP0_TP1 pid=336698) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:81] Using Flash Attention backend.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [flash_attn.py:622] Using FlashAttention version 2
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [mxfp4.py:352] Using 'XPU' Mxfp4 MoE backend.
(Worker_PP1_TP0 pid=336699) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP1_TP1 pid=336700) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:00<00:00,  2.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.77it/s]
(Worker_PP0_TP0 pid=336697)
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [default_loader.py:384] Loading weights took 1.75 seconds
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [mxfp4.py:836] Using MoEPrepareAndFinalizeNoDPEPModular
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [gpu_model_runner.py:4820] Model loading took 3.85 GiB memory and 2.185793 seconds
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [gpu_worker.py:436] Available KV cache memory: 14.86 GiB
(EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1319] GPU KV cache size: 1,222,720 tokens
(EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1324] Maximum concurrency for 8,192 tokens per request: 149.26x
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to NHD.
(EngineCore pid=336188) INFO 04-04 18:12:32 [core.py:283] init engine (profile, create kv cache, warmup model) took 3.13 seconds
(EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=336188) INFO 04-04 18:12:33 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native'])
(EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:1046] Cudagraph is disabled under eager mode
(EngineCore pid=336188) WARNING 04-04 18:12:33 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode.
(EngineCore pid=336188) INFO 04-04 18:12:33 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=335596) INFO 04-04 18:12:33 [api_server.py:604] Supported tasks: ['generate']
(APIServer pid=335596) WARNING 04-04 18:12:33 [serving.py:233] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=335596) INFO 04-04 18:12:34 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=335596) INFO 04-04 18:12:34 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:37] Available routes are:
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=335596) INFO:     Started server process [335596]
(APIServer pid=335596) INFO:     Waiting for application startup.
(APIServer pid=335596) INFO:     Application startup complete.
(APIServer pid=335596) INFO:     127.0.0.1:34608 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=335596) INFO:     127.0.0.1:34610 - "POST /v1/completions HTTP/1.1" 200 OK
(Worker_PP0_TP0 pid=336697) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.)
(Worker_PP0_TP0 pid=336697)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(Worker_PP0_TP1 pid=336698) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.)
(Worker_PP0_TP1 pid=336698)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(APIServer pid=335596) INFO 04-04 18:13:04 [loggers.py:259] Engine 000: Avg prompt throughput: 102.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=335596) INFO 04-04 18:13:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(EngineCore pid=336188) INFO 04-04 18:13:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:14:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:15:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:16:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto'),
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-bench-dd7f173a-0-0-8a43c102'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([17],)],num_computed_tokens=[1024],num_output_tokens=[1]), num_scheduled_tokens={cmpl-bench-dd7f173a-0-0-8a43c102: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[17], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0008898659966498634, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] EngineCore encountered a fatal error.
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 394, in get_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return next(self.gen)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                                          ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise TimeoutError
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] The above exception was the direct cause of the following exception:
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1101, in run_engine_core
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     engine_core.run_busy_loop()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1142, in run_busy_loop
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     self._process_engine_step()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1181, in _process_engine_step
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     outputs, model_executed = self.step_fn()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                               ^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 499, in step_with_batch_queue
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     model_output = future.result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                    ^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 88, in result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return super().result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return self.__get_result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise self._exception
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 92, in _wait_for_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     response = self.aggregate(self.get_response())
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 396, in get_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError: RPC call to sample_tokens timed out.
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]   File "/home/cjai/vllm/vllm/v1/engine/async_llm.py", line 663, in output_handler
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]     outputs = await engine_core.get_output_async()
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]   File "/home/cjai/vllm/vllm/v1/engine/core_client.py", line 970, in get_output_async
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]     raise self._format_exception(outputs) from None
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=335596) INFO:     Shutting down
(APIServer pid=335596) INFO:     Waiting for application shutdown.
(APIServer pid=335596) INFO:     Application shutdown complete.
(APIServer pid=335596) INFO:     Finished server process [335596]

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 25.10 (x86_64)
GCC version                  : (Ubuntu 15.2.0-4ubuntu4) 15.2.0
Clang version                : Could not collect
CMake version                : version 4.3.1
Libc version                 : glibc-2.42

==============================
       PyTorch Info
==============================
PyTorch version              : 2.11.0+xpu
Is debug build               : False
CUDA used to build PyTorch   : None
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.13 | packaged by Anaconda, Inc. | (main, Mar 19 2026, 20:20:58) [GCC 14.3.0] (64-bit runtime)
Python platform              : Linux-6.17.0-20-generic-x86_64-with-glibc2.42

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  48
On-line CPU(s) list:                     0-47
Vendor ID:                               GenuineIntel
Model name:                              Intel(R) Xeon(R) w7-3455
CPU family:                              6
Model:                                   143
Thread(s) per core:                      2
Core(s) per socket:                      24
Socket(s):                               1
Stepping:                                5
CPU(s) scaling MHz:                      25%
CPU max MHz:                             4800.0000
CPU min MHz:                             800.0000
BogoMIPS:                                4992.00
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
Virtualization:                          VT-x
L1d cache:                               1.1 MiB (24 instances)
L1i cache:                               768 KiB (24 instances)
L2 cache:                                48 MiB (24 instances)
L3 cache:                                67.5 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-47
Vulnerability Gather data sampling:      Not affected
Vulnerability Ghostwrite:                Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Old microcode:             Vulnerable
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; PBRSB-eIBRS SW sequence; BHI BHI_DIS_S
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Mitigation; IBPB before exit to userspace

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.4.3
[pip3] pyzmq==27.1.0
[pip3] torch==2.11.0+xpu
[pip3] torchao==0.17.0+xpu
[pip3] torchaudio==2.11.0+xpu
[pip3] torchvision==0.26.0+xpu
[pip3] transformers==4.57.6
[pip3] triton==3.6.0
[pip3] triton-xpu==3.7.0
[conda] mkl                                         2025.3.1                        pypi_0           pypi
[conda] numpy                                       2.4.3                           pypi_0           pypi
[conda] onemkl-license                              2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-blas                            2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-dft                             2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-lapack                          2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-rng                             2025.3.1                        pypi_0           pypi
[conda] onemkl-sycl-sparse                          2025.3.1                        pypi_0           pypi
[conda] pyzmq                                       27.1.0                          pypi_0           pypi
[conda] torch                                       2.11.0+xpu                      pypi_0           pypi
[conda] torchao                                     0.17.0+xpu                      pypi_0           pypi
[conda] torchaudio                                  2.11.0+xpu                      pypi_0           pypi
[conda] torchvision                                 0.26.0+xpu                      pypi_0           pypi
[conda] transformers                                4.57.6                          pypi_0           pypi
[conda] triton                                      3.6.0                           pypi_0           pypi
[conda] triton-xpu                                  3.7.0                           pypi_0           pypi

==============================
         vLLM Info
==============================
ROCM Version                 : Could not collect
vLLM Version                 : 0.19.1rc1.dev29+g93726b2a1 (git sha: 93726b2a1)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  Could not collect

==============================
     Environment Variables
==============================
VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
VLLM_WORKER_MULTIPROC_METHOD=spawn
VLLM_XPU_ENABLE_XPU_GRAPH=1
LD_LIBRARY_PATH=/opt/intel/oneapi/tcm/1.4/lib:/opt/intel/oneapi/umf/1.0/lib:/opt/intel/oneapi/tbb/2022.3/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.17/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.17/lib:/opt/intel/oneapi/mkl/2025.3/lib:/opt/intel/oneapi/ippcp/2025.3/lib/:/opt/intel/oneapi/ipp/2022.3/lib:/opt/intel/oneapi/dnnl/2025.3/lib:/opt/intel/oneapi/debugger/2025.3/opt/debugger/lib:/opt/intel/oneapi/dal/2025.10/lib:/opt/intel/oneapi/compiler/2025.3/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.3/lib:/opt/intel/oneapi/ccl/2021.17/lib/:/home/cjai/miniforge3/envs/vllm/lib:/opt/intel/oneapi/tcm/1.4/lib:/opt/intel/oneapi/umf/1.0/lib:/opt/intel/oneapi/tbb/2022.3/env/../lib/intel64/gcc4.8:/opt/intel/oneapi/mpi/2021.17/opt/mpi/libfabric/lib:/opt/intel/oneapi/mpi/2021.17/lib:/opt/intel/oneapi/mkl/2025.3/lib:/opt/intel/oneapi/ippcp/2025.3/lib/:/opt/intel/oneapi/ipp/2022.3/lib:/opt/intel/oneapi/dnnl/2025.3/lib:/opt/intel/oneapi/debugger/2025.3/opt/debugger/lib:/opt/intel/oneapi/dal/2025.10/lib:/opt/intel/oneapi/compiler/2025.3/opt/compiler/lib:/opt/intel/oneapi/compiler/2025.3/lib:/opt/intel/oneapi/ccl/2021.17/lib/
OMP_NUM_THREADS=48
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_cjai

</details>

🐛 Describe the bug

vllm serve     --model /data/models/gpt-oss-20b/     --enforce-eager     --port 8000     --host 0.0.0.0     --trust-remote-code     --disable-sliding-window     --gpu-memory-util=0.87     --max-num-batched-tokens=8192         --max-model-len=8192     --block-size 64        -tp=2 -pp 2   --distributed-executor-backend=mp

WARNING 04-04 18:12:14 [argparse_utils.py:191] With `vllm serve`, you should provide the model as a positional argument or in a config file instead of via the `--model` option. The `--model` option will be removed in v0.13.
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]        █     █     █▄   ▄█
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]  ▄▄ ▄█ █     █     █ ▀▄▀ █  version 0.19.1rc1.dev29+g93726b2a1
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]   █▄█▀ █     █     █     █  model   /data/models/gpt-oss-20b/
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]    ▀▀  ▀▀▀▀▀ ▀▀▀▀▀ ▀     ▀
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:299]
(APIServer pid=335596) INFO 04-04 18:12:14 [utils.py:233] non-default args: {'model_tag': '/data/models/gpt-oss-20b/', 'host': '0.0.0.0', 'model': '/data/models/gpt-oss-20b/', 'trust_remote_code': True, 'max_model_len': 8192, 'enforce_eager': True, 'disable_sliding_window': True, 'distributed_executor_backend': 'mp', 'pipeline_parallel_size': 2, 'tensor_parallel_size': 2, 'block_size': 64, 'gpu_memory_utilization': 0.87, 'max_num_batched_tokens': 8192}
(APIServer pid=335596) INFO 04-04 18:12:14 [model.py:554] Resolved architecture: GptOssForCausalLM
(APIServer pid=335596) INFO 04-04 18:12:14 [model.py:1685] Using max model len 8192
(APIServer pid=335596) INFO 04-04 18:12:14 [scheduler.py:238] Chunked prefill is enabled with max_num_batched_tokens=8192.
(APIServer pid=335596) INFO 04-04 18:12:14 [config.py:126] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:799] Asynchronous scheduling is enabled.
(APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=335596) WARNING 04-04 18:12:14 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=335596) INFO 04-04 18:12:14 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native'])
(APIServer pid=335596) INFO 04-04 18:12:14 [vllm.py:1046] Cudagraph is disabled under eager mode
(APIServer pid=335596) WARNING 04-04 18:12:14 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode.
(APIServer pid=335596) INFO 04-04 18:12:14 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=336188) INFO 04-04 18:12:19 [core.py:105] Initializing a V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto')
(EngineCore pid=336188) INFO 04-04 18:12:19 [multiproc_executor.py:137] DP group leader: node_rank=0, node_rank_within_dp=0, master_addr=127.0.0.1, mq_connect_ip=192.168.5.175 (local), world_size=4, local_world_size=4
(Worker pid=336700) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=3 local_rank=3 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336699) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=2 local_rank=2 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336697) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=0 local_rank=0 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336698) INFO 04-04 18:12:24 [parallel_state.py:1400] world_size=4 rank=1 local_rank=1 distributed_init_method=tcp://127.0.0.1:40271 backend=xccl
(Worker pid=336697) INFO 04-04 18:12:25 [parallel_state.py:1712] rank 0 in world size 4 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
2026:04:04-18:12:25:336697 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336697 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336698 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336698 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336700 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336700 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:25:336699 |CCL_WARN| value of CCL_ATL_TRANSPORT changed to be ofi (default:mpi)
2026:04:04-18:12:25:336699 |CCL_WARN| could not get local_idx/count from environment variables, trying to get them from ATL
2026:04:04-18:12:26:336698:[1] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336699:[2] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336697:[0] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
2026:04:04-18:12:26:336700:[3] |CCL_WARN| topology recognition shows PCIe connection between devices. If this is not correct, you can disable topology recognition, with CCL_TOPO_FABRIC_VERTEX_CONNECTION_CHECK=0. This will assume XeLinks across devices
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:26 [gpu_model_runner.py:4735] Starting to load model /data/models/gpt-oss-20b/...
(Worker_PP0_TP1 pid=336698) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [xpu.py:81] Using Flash Attention backend.
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [flash_attn.py:622] Using FlashAttention version 2
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:27 [mxfp4.py:352] Using 'XPU' Mxfp4 MoE backend.
(Worker_PP1_TP0 pid=336699) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
(Worker_PP1_TP1 pid=336700) INFO 04-04 18:12:27 [xpu.py:59] Setting VLLM_KV_CACHE_LAYOUT to 'NHD' for XPU; only NHD layout is supported by XPU attention kernels.
Loading safetensors checkpoint shards:   0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  33% Completed | 1/3 [00:00<00:00,  2.23it/s]
Loading safetensors checkpoint shards:  67% Completed | 2/3 [00:00<00:00,  2.29it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.64it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00,  1.77it/s]
(Worker_PP0_TP0 pid=336697)
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [default_loader.py:384] Loading weights took 1.75 seconds
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [mxfp4.py:836] Using MoEPrepareAndFinalizeNoDPEPModular
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:29 [gpu_model_runner.py:4820] Model loading took 3.85 GiB memory and 2.185793 seconds
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [gpu_worker.py:436] Available KV cache memory: 14.86 GiB
(EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1319] GPU KV cache size: 1,222,720 tokens
(EngineCore pid=336188) INFO 04-04 18:12:32 [kv_cache_utils.py:1324] Maximum concurrency for 8,192 tokens per request: 149.26x
(Worker_PP0_TP0 pid=336697) INFO 04-04 18:12:32 [utils.py:60] `_KV_CACHE_LAYOUT_OVERRIDE` variable detected. Setting KV cache layout to NHD.
(EngineCore pid=336188) INFO 04-04 18:12:32 [core.py:283] init engine (profile, create kv cache, warmup model) took 3.13 seconds
(EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:799] Asynchronous scheduling is enabled.
(EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:857] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=336188) WARNING 04-04 18:12:33 [vllm.py:868] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=336188) INFO 04-04 18:12:33 [kernel.py:199] Final IR op priority after setting platform defaults: IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native'])
(EngineCore pid=336188) INFO 04-04 18:12:33 [vllm.py:1046] Cudagraph is disabled under eager mode
(EngineCore pid=336188) WARNING 04-04 18:12:33 [xpu.py:190] XPU Graph doesn't support capture communication ops, disabling cudagraph_mode.
(EngineCore pid=336188) INFO 04-04 18:12:33 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(APIServer pid=335596) INFO 04-04 18:12:33 [api_server.py:604] Supported tasks: ['generate']
(APIServer pid=335596) WARNING 04-04 18:12:33 [serving.py:233] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=335596) INFO 04-04 18:12:34 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=335596) INFO 04-04 18:12:34 [api_server.py:608] Starting vLLM server on http://0.0.0.0:8000
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:37] Available routes are:
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /openapi.json, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /redoc, Methods: HEAD, GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=335596) INFO 04-04 18:12:34 [launcher.py:46] Route: /generative_scoring, Methods: POST
(APIServer pid=335596) INFO:     Started server process [335596]
(APIServer pid=335596) INFO:     Waiting for application startup.
(APIServer pid=335596) INFO:     Application startup complete.
(APIServer pid=335596) INFO:     127.0.0.1:34608 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=335596) INFO:     127.0.0.1:34610 - "POST /v1/completions HTTP/1.1" 200 OK
(Worker_PP0_TP0 pid=336697) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.)
(Worker_PP0_TP0 pid=336697)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(Worker_PP0_TP1 pid=336698) /home/cjai/vllm/vllm/distributed/parallel_state.py:664: UserWarning: The given buffer is not writable, and PyTorch does not support non-writable tensors. This means you can write to the underlying (supposedly non-writable) buffer using the tensor. You may want to copy the buffer to protect its data or make it writable before converting it to a tensor. This type of warning will be suppressed for the rest of this program. (Triggered internally at /pytorch/torch/csrc/utils/tensor_new.cpp:1586.)
(Worker_PP0_TP1 pid=336698)   object_tensor = torch.frombuffer(pickle.dumps(obj), dtype=torch.uint8)
(APIServer pid=335596) INFO 04-04 18:13:04 [loggers.py:259] Engine 000: Avg prompt throughput: 102.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=335596) INFO 04-04 18:13:14 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(EngineCore pid=336188) INFO 04-04 18:13:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:14:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:15:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) INFO 04-04 18:16:57 [shm_broadcast.py:681] No available shared memory broadcast block found in 60 seconds. This typically happens when some processes are hanging or doing some time-consuming work (e.g. compilation, weight/kv cache quantization).
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.19.1rc1.dev29+g93726b2a1) with config: model='/data/models/gpt-oss-20b/', speculative_config=None, tokenizer='/data/models/gpt-oss-20b/', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=2, pipeline_parallel_size=2, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=True, quantization=mxfp4, quantization_config=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=xpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/data/models/gpt-oss-20b/, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'ir_enable_torch_wrap': False, 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [8192], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': False, 'static_all_moe_layers': []}, kernel_config=KernelConfig(ir_op_priority=IrOpPriorityConfig(rms_norm=['xpu_kernels', 'native']), enable_flashinfer_autotune=True, moe_backend='auto'),
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[], scheduled_cached_reqs=CachedRequestData(req_ids=['cmpl-bench-dd7f173a-0-0-8a43c102'],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[([17],)],num_computed_tokens=[1024],num_output_tokens=[1]), num_scheduled_tokens={cmpl-bench-dd7f173a-0-0-8a43c102: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[17], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=null)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.0008898659966498634, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] EngineCore encountered a fatal error.
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 394, in get_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     status, result = mq.dequeue(timeout=dequeue_timeout)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 755, in dequeue
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     with self.acquire_read(timeout, indefinite) as buf:
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/contextlib.py", line 137, in __enter__
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return next(self.gen)
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 674, in acquire_read
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     self._spin_condition.wait(timeout_ms=read_timeout.timeout_ms())
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                                          ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/distributed/device_communicators/shm_broadcast.py", line 631, in timeout_ms
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise TimeoutError
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] The above exception was the direct cause of the following exception:
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] Traceback (most recent call last):
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1101, in run_engine_core
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     engine_core.run_busy_loop()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1142, in run_busy_loop
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     self._process_engine_step()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 1181, in _process_engine_step
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     outputs, model_executed = self.step_fn()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                               ^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/engine/core.py", line 499, in step_with_batch_queue
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     model_output = future.result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                    ^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 88, in result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return super().result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     return self.__get_result()
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/miniforge3/envs/vllm/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise self._exception
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 92, in _wait_for_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     response = self.aggregate(self.get_response())
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]                               ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]   File "/home/cjai/vllm/vllm/v1/executor/multiproc_executor.py", line 396, in get_response
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110]     raise TimeoutError(f"RPC call to {method} timed out.") from e
(EngineCore pid=336188) ERROR 04-04 18:17:54 [core.py:1110] TimeoutError: RPC call to sample_tokens timed out.
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] AsyncLLM output_handler failed.
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] Traceback (most recent call last):
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]   File "/home/cjai/vllm/vllm/v1/engine/async_llm.py", line 663, in output_handler
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]     outputs = await engine_core.get_output_async()
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]   File "/home/cjai/vllm/vllm/v1/engine/core_client.py", line 970, in get_output_async
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707]     raise self._format_exception(outputs) from None
(APIServer pid=335596) ERROR 04-04 18:17:54 [async_llm.py:707] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=335596) INFO:     Shutting down
(APIServer pid=335596) INFO:     Waiting for application shutdown.
(APIServer pid=335596) INFO:     Application shutdown complete.
(APIServer pid=335596) INFO:     Finished server process [335596]

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely caused by a timeout error in the EngineCore, which may be due to a communication issue between processes or a problem with the model execution.

Guidance

Check the model execution: Verify that the model is executing correctly and not causing any timeouts. This can be done by checking the model's performance and adjusting the dequeue_timeout parameter if necessary.
Investigate communication issues: Look into potential communication issues between processes, such as problems with shared memory or message queues. Check the shm_broadcast.py file for any errors or warnings related to communication.
Adjust timeout parameters: Consider increasing the timeout parameters, such as dequeue_timeout, to give the model more time to execute and respond.
Verify process synchronization: Ensure that the processes are properly synchronized and that there are no deadlocks or other synchronization issues.

Example

No specific code example can be provided without more information about the model and the execution environment. However, the error message suggests that the issue is related to the EngineCore and the shm_broadcast.py file, which may require adjustments to the model execution or communication parameters.

Notes

The error message indicates a timeout error in the EngineCore, which may be caused by a variety of factors, including model execution issues, communication problems, or synchronization errors. Further investigation is needed to determine the root cause of the issue.

Recommendation

Apply a workaround by adjusting the timeout parameters, such as increasing the dequeue_timeout, to give the model more time to execute and respond. If the issue persists, further investigation into the model execution and communication is necessary.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #memory optimization #model loading #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]:TimeoutError: RPC call to sample_tokens timed out. when pp is on under xpu env [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

============================== CPU Info

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]:TimeoutError: RPC call to sample_tokens timed out. when pp is on under xpu env [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

============================== CPU Info

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING