vllm - ✅(Solved) Fix [Bug]: CPU backend crashes with `TypeError: 'function' object is not subscriptable` on first inference request [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37546Fetched 2026-04-08 01:02:16
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Author
Participants
Timeline (top)
project_v2_item_status_changed ×3labeled ×2added_to_project_v2 ×1closed ×1

Error Message

  • MODEL_DIR=/mnt/models/Qwen3.5-0.8B
  • '[' '!' -d /mnt/models/Qwen3.5-0.8B ']'
  • MODEL_DIR=/mnt/models
  • echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead'
  • python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching [WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻� 鈻� 鈻堚杽 鈻勨枅 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻勨杽 鈻勨枅 鈻� 鈻� 鈻� 鈻€鈻勨杸 鈻� version 0.17.1 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻堚杽鈻堚杸 鈻� 鈻� 鈻� 鈻� model /mnt/models (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€ 鈻€ (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False} (APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL (APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144 (APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)] get_mempolicy: Operation not permitted [W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env) set_mempolicy: Operation not permitted [W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env) (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157: (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 157, core 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 243, core 1 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 244, core 2 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 245, core 3 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 246, core 4 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 247, core 5 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 248, core 6 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 249, core 7 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 250, core 8 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 251, core 9 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 252, core 10 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 253, core 11 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 254, core 12 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 255, core 13 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 256, core 14 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 257, core 15 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models... (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention. (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds (EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x (EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation... (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done. (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds (EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled. (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms (APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080 (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4]) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] raise self._exception (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._update_states(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable (EngineCore_DP0 pid=157) Process EngineCore_DP0: (EngineCore_DP0 pid=157) Traceback (most recent call last): (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=157) self.run() (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=157) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=157) raise e (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) engine_core.run_busy_loop() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) self._process_engine_step() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) model_output = future.result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) return self.__get_result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) raise self._exception (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) self._update_states(scheduler_output) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1) INFO: 127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1]

Root Cause

<details> <summary>TypeError: 'function' object is not subscriptable </summary> ```bash + MODEL_DIR=/mnt/models/Qwen3.5-0.8B + '[' '!' -d /mnt/models/Qwen3.5-0.8B ']' + MODEL_DIR=/mnt/models + echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead' + python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching [WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻� 鈻� 鈻堚杽 鈻勨枅 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻勨杽 鈻勨枅 鈻� 鈻� 鈻� 鈻€鈻勨杸 鈻� version 0.17.1 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻堚杽鈻堚杸 鈻� 鈻� 鈻� 鈻� model /mnt/models (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€ 鈻€ (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False} (APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL (APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144 (APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)] get_mempolicy: Operation not permitted [W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env) set_mempolicy: Operation not permitted [W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env) (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157: (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 157, core 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 243, core 1 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 244, core 2 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 245, core 3 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 246, core 4 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 247, core 5 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 248, core 6 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 249, core 7 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 250, core 8 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 251, core 9 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 252, core 10 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 253, core 11 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 254, core 12 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 255, core 13 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 256, core 14 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 257, core 15 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models... (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention. (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds (EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x (EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation... (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done. (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds (EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled. (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms (APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080 (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4]) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] raise self._exception (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._update_states(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable (EngineCore_DP0 pid=157) Process EngineCore_DP0: (EngineCore_DP0 pid=157) Traceback (most recent call last): (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=157) self.run() (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=157) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=157) raise e (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) engine_core.run_busy_loop() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) self._process_engine_step() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) model_output = future.result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) return self.__get_result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) raise self._exception (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) self._update_states(scheduler_output) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1) INFO: 127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1] ```

Fix Action

Fixed

PR fix notes

PR #37550: [Bugfix] Fix CPU backend crash in KV cache block zeroing

Description (problem / solution / changelog)

  • Override _zero_block_ids in CPUModelRunner with a pure PyTorch implementation to avoid calling the Triton GPU kernel (_zero_kv_blocks_kernel), which crashes on CPU nodes without an active GPU driver.

    • The Triton block-zeroing kernel was introduced in #35219 (March 10), but CPUModelRunner lacked a CPU-safe fallback. This caused a TypeError: 'function' object is not subscriptable on the first inference request for all models using the CPU backend.

Closes #37546

Test plan

  • Verified syntax and pre-commit hooks pass
  • Implemented a minimal override using PyTorch (tensor.zero_()) to replace the Triton kernel path only for CPU
  • Existing CPU CI tests cover the integration path

Changed files

  • vllm/v1/worker/cpu_model_runner.py (modified, +9/-0)

Code Example

Your output of `python collect_env.py` here

---

+ MODEL_DIR=/mnt/models/Qwen3.5-0.8B
+ '[' '!' -d /mnt/models/Qwen3.5-0.8B ']'
+ MODEL_DIR=/mnt/models
+ echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead'
+ python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching
[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead
INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]        鈻�     鈻�     鈻堚杽   鈻勨枅
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]  鈻勨杽 鈻勨枅 鈻�     鈻�     鈻� 鈻€鈻勨杸 鈻�  version 0.17.1
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]   鈻堚杽鈻堚杸 鈻�     鈻�     鈻�     鈻�  model   /mnt/models
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]    鈻€鈻€  鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€     鈻€
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False}
(APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL
(APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144
(APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
(EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)]
get_mempolicy: Operation not permitted
[W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
set_mempolicy: Operation not permitted
[W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env)
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157:
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 157, core 0
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 243, core 1
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 244, core 2
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 245, core 3
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 246, core 4
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 247, core 5
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 248, core 6
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 249, core 7
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 250, core 8
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 251, core 9
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 252, core 10
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 253, core 11
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 254, core 12
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 255, core 13
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 256, core 14
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 257, core 15
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models...
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(EngineCore_DP0 pid=157) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=157) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.09s/it]
(EngineCore_DP0 pid=157) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.09s/it]
(EngineCore_DP0 pid=157) 
(EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds
(EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information.
(EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens
(EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x
(EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done.
(EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds
(EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing...
(APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms
(APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4])
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._process_engine_step()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     model_output = future.result()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return self.__get_result()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     raise self._exception
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._update_states(scheduler_output)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._zero_block_ids(scheduler_output.new_block_ids_to_zero)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._kv_block_zeroer.zero_block_ids(block_ids)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     _zero_kv_blocks_kernel[grid](
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable
(EngineCore_DP0 pid=157) Process EngineCore_DP0:
(EngineCore_DP0 pid=157) Traceback (most recent call last):
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=157)     self.run()
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=157)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=157)     raise e
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=157)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=157)     self._process_engine_step()
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=157)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=157)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
(EngineCore_DP0 pid=157)     model_output = future.result()
(EngineCore_DP0 pid=157)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=157)     return self.__get_result()
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=157)     raise self._exception
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=157)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=157)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=157)     return func(*args, **kwargs)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=157)     return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157)     return func(*args, **kwargs)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model
(EngineCore_DP0 pid=157)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=157)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157)     return func(*args, **kwargs)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model
(EngineCore_DP0 pid=157)     self._update_states(scheduler_output)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states
(EngineCore_DP0 pid=157)     self._zero_block_ids(scheduler_output.new_block_ids_to_zero)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids
(EngineCore_DP0 pid=157)     self._kv_block_zeroer.zero_block_ids(block_ids)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids
(EngineCore_DP0 pid=157)     _zero_kv_blocks_kernel[grid](
(EngineCore_DP0 pid=157)     ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO:     127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Environment

  • vLLM version: 0.17.1
  • Python: 3.12
  • PyTorch:
  • Device: CPU (device_config=cpu)
  • Model: Qwen3_5ForConditionalGeneration (Qwen3.5-VL-0.8B)
  • OS: Linux (Kubernetes pod, no GPU)

Description

When running vLLM with the CPU backend, the engine crashes on the first inference request with:

<details> <summary>TypeError: 'function' object is not subscriptable </summary> ```bash + MODEL_DIR=/mnt/models/Qwen3.5-0.8B + '[' '!' -d /mnt/models/Qwen3.5-0.8B ']' + MODEL_DIR=/mnt/models + echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead' + python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching [WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻� 鈻� 鈻堚杽 鈻勨枅 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻勨杽 鈻勨枅 鈻� 鈻� 鈻� 鈻€鈻勨杸 鈻� version 0.17.1 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻堚杽鈻堚杸 鈻� 鈻� 鈻� 鈻� model /mnt/models (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€ 鈻€ (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False} (APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL (APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144 (APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)] get_mempolicy: Operation not permitted [W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env) set_mempolicy: Operation not permitted [W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env) (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157: (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 157, core 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 243, core 1 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 244, core 2 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 245, core 3 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 246, core 4 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 247, core 5 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 248, core 6 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 249, core 7 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 250, core 8 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 251, core 9 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 252, core 10 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 253, core 11 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 254, core 12 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 255, core 13 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 256, core 14 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 257, core 15 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models... (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention. (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds (EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x (EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation... (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done. (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds (EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled. (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms (APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080 (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4]) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] raise self._exception (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._update_states(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable (EngineCore_DP0 pid=157) Process EngineCore_DP0: (EngineCore_DP0 pid=157) Traceback (most recent call last): (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=157) self.run() (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=157) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=157) raise e (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) engine_core.run_busy_loop() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) self._process_engine_step() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) model_output = future.result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) return self.__get_result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) raise self._exception (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) self._update_states(scheduler_output) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1) INFO: 127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1] ```

The server starts successfully and completes warmup, but dies immediately when a request triggers KV cache block allocation.

</details>

Steps to Reproduce

python3 -m vllm.entrypoints.openai.api_server \
  --model /path/to/qwen3.5-vl \
  --dtype half \
  --enforce-eager \
  --no-enable-prefix-caching

Then send any completion request:
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model", "prompt": "hello", "max_tokens": 16}'

Actual Behavior

(EngineCore_DP0) ERROR [core.py:1102]   File ".../vllm/v1/worker/gpu_model_runner.py", line 978, in
_update_states
(EngineCore_DP0) ERROR [core.py:1102]
self._zero_block_ids(scheduler_output.new_block_ids_to_zero)
(EngineCore_DP0) ERROR [core.py:1102]   File ".../vllm/v1/worker/gpu_model_runner.py", line 940, in
_zero_block_ids
(EngineCore_DP0) ERROR [core.py:1102]     self._kv_block_zeroer.zero_block_ids(block_ids)
(EngineCore_DP0) ERROR [core.py:1102]   File ".../vllm/v1/worker/utils.py", line 210, in
zero_block_ids
(EngineCore_DP0) ERROR [core.py:1102]     _zero_kv_blocks_kernel[grid](
(EngineCore_DP0) ERROR [core.py:1102]     ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(EngineCore_DP0) ERROR [core.py:1102] TypeError: 'function' object is not subscriptable

Expected Behavior

CPU backend should handle KV cache block zeroing without calling a Triton GPU kernel.

Root Cause Analysis

Two issues combine to cause this:

1. CPUModelRunner inherits _zero_block_ids from GPUModelRunner without overriding it:

The CPU worker delegates to gpu_worker.py / gpu_model_runner.py for model execution. _zero_block_ids
is only implemented in GPUModelRunner using a Triton kernel (_zero_kv_blocks_kernel in utils.py), and
 CPUModelRunner does not override it with a CPU-safe fallback.

2. Triton is disabled on CPU nodes:

Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton.

When Triton is disabled, @triton.jit becomes a no-op decorator returning a plain Python function.
Calling _zero_kv_blocks_kernel[grid](...) on a plain function fails because regular functions do not
implement __getitem__.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to modify the CPUModelRunner class to override the _zero_block_ids method with a CPU-safe implementation.

Here are the steps:

  • Identify the CPUModelRunner class in the codebase.
  • Override the _zero_block_ids method in CPUModelRunner to provide a CPU-safe implementation.
  • Ensure the new implementation does not rely on Triton GPU kernels.

Example code:

class CPUModelRunner:
    # ...

    def _zero_block_ids(self, block_ids):
        # CPU-safe implementation to zero block IDs
        # For example, using PyTorch tensors
        import torch
        zero_tensor = torch.zeros_like(block_ids)
        block_ids.copy_(zero_tensor)

Alternatively, if the _zero_block_ids method is not necessary for CPU execution, it can be modified to raise a NotImplementedError or return without performing any action.

Verification

To verify the fix, run the same command to start the API server and send a completion request:

python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/qwen3.5-vl \
    --dtype half \
    --enforce-eager \
    --no-enable-prefix-caching

curl -X POST http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "my-model", "prompt": "hello", "max_tokens": 16}'

The API server should now handle the completion request without crashing due to the TypeError.

Extra Tips

  • When working with GPU-accelerated code on CPU nodes, ensure that all GPU-specific functionality is properly disabled or overridden.
  • Use logging and debugging tools to identify the root cause of issues and verify fixes.
  • Consider adding unit tests or integration tests to cover scenarios like this and prevent similar issues in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING