vllm - ✅(Solved) Fix [Bug]: CPU backend crashes with `TypeError: 'function' object is not subscriptable` on first inference request [1 pull requests, 1 participants]

vllm2026-03-19 10:07:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37546•Fetched 2026-04-08 01:02:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

fyuan1316

Participants

fyuan1316

Timeline (top)

project_v2_item_status_changed ×3labeled ×2added_to_project_v2 ×1closed ×1

Error Message

MODEL_DIR=/mnt/models/Qwen3.5-0.8B
'[' '!' -d /mnt/models/Qwen3.5-0.8B ']'
MODEL_DIR=/mnt/models
echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead'
python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching [WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻� 鈻� 鈻堚杽鈻勨枅 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻勨杽鈻勨枅鈻� 鈻� 鈻� 鈻€鈻勨杸鈻� version 0.17.1 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻堚杽鈻堚杸鈻� 鈻� 鈻� 鈻� model /mnt/models (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€ 鈻€ (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False} (APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL (APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144 (APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)] get_mempolicy: Operation not permitted [W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env) set_mempolicy: Operation not permitted [W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env) (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157: (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 157, core 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 243, core 1 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 244, core 2 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 245, core 3 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 246, core 4 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 247, core 5 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 248, core 6 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 249, core 7 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 250, core 8 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 251, core 9 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 252, core 10 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 253, core 11 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 254, core 12 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 255, core 13 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 256, core 14 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 257, core 15 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models... (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention. (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds (EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x (EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation... (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done. (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds (EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled. (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set --chat-template-content-format to override this. (APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms (APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080 (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4]) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] raise self._exception (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._update_states(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable (EngineCore_DP0 pid=157) Process EngineCore_DP0: (EngineCore_DP0 pid=157) Traceback (most recent call last): (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=157) self.run() (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=157) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=157) raise e (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) engine_core.run_busy_loop() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) self._process_engine_step() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) model_output = future.result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) return self.__get_result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) raise self._exception (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) self._update_states(scheduler_output) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1) INFO: 127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1]

Root Cause

<details> <summary>TypeError: 'function' object is not subscriptable </summary> ```bash + MODEL_DIR=/mnt/models/Qwen3.5-0.8B + '[' '!' -d /mnt/models/Qwen3.5-0.8B ']' + MODEL_DIR=/mnt/models + echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead' + python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching [WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻� 鈻� 鈻堚杽鈻勨枅 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻勨杽鈻勨枅鈻� 鈻� 鈻� 鈻€鈻勨杸鈻� version 0.17.1 (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻堚杽鈻堚杸鈻� 鈻� 鈻� 鈻� model /mnt/models (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€ 鈻€ (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] (APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False} (APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL (APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16. (APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144 (APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors. INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available. /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " /opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e' "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature " (EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)] get_mempolicy: Operation not permitted [W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env) set_mempolicy: Operation not permitted [W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env) (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157: (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 157, core 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 243, core 1 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 244, core 2 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 245, core 3 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 246, core 4 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 247, core 5 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 248, core 6 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 249, core 7 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 250, core 8 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 251, core 9 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 252, core 10 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 253, core 11 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 254, core 12 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 255, core 13 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 256, core 14 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP tid: 257, core 15 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 (EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models... (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention (EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention. (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00, 3.09s/it] (EngineCore_DP0 pid=157) (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds (EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information. (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens (EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x (EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation... (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done. (EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds (EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled. (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none (EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored. (APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate'] (APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing... (APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this. (APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms (APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080 (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are: (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST (APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST (APIServer pid=1) INFO: Started server process [1] (APIServer pid=1) INFO: Waiting for application startup. (APIServer pid=1) INFO: Application startup complete. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4]) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error. (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last): (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] engine_core.run_busy_loop() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._process_engine_step() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] model_output = future.result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.__get_result() (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] raise self._exception (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] return func(*args, **kwargs) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._update_states(scheduler_output) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable (EngineCore_DP0 pid=157) Process EngineCore_DP0: (EngineCore_DP0 pid=157) Traceback (most recent call last): (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=157) self.run() (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=157) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=157) raise e (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core (EngineCore_DP0 pid=157) engine_core.run_busy_loop() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop (EngineCore_DP0 pid=157) self._process_engine_step() (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step (EngineCore_DP0 pid=157) outputs, model_executed = self.step_fn() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step (EngineCore_DP0 pid=157) model_output = future.result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result (EngineCore_DP0 pid=157) return self.__get_result() (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result (EngineCore_DP0 pid=157) raise self._exception (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc (EngineCore_DP0 pid=157) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model (EngineCore_DP0 pid=157) return self.worker.execute_model(scheduler_output) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model (EngineCore_DP0 pid=157) output = self.model_runner.execute_model( (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context (EngineCore_DP0 pid=157) return func(*args, **kwargs) (EngineCore_DP0 pid=157) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model (EngineCore_DP0 pid=157) self._update_states(scheduler_output) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states (EngineCore_DP0 pid=157) self._zero_block_ids(scheduler_output.new_block_ids_to_zero) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids (EngineCore_DP0 pid=157) self._kv_block_zeroer.zero_block_ids(block_ids) (EngineCore_DP0 pid=157) File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids (EngineCore_DP0 pid=157) _zero_kv_blocks_kernel[grid]( (EngineCore_DP0 pid=157) ~~~~~~~~~~~~~~~~~~~~~~^^^^^^ (EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last): (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] outputs = await engine_core.get_output_async() (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] raise self._format_exception(outputs) from None (APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause. (APIServer pid=1) INFO: 127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error (APIServer pid=1) INFO: Shutting down (APIServer pid=1) INFO: Waiting for application shutdown. (APIServer pid=1) INFO: Application shutdown complete. (APIServer pid=1) INFO: Finished server process [1] ```

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix CPU backend crash in KV cache block zeroing (https://github.com/vllm-project/vllm/pull/37550)

PR fix notes

PR #37550: [Bugfix] Fix CPU backend crash in KV cache block zeroing

Repository: vllm-project/vllm
Author: DorBernsohn
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37550

Description (problem / solution / changelog)

Override _zero_block_ids in CPUModelRunner with a pure PyTorch implementation to avoid calling the Triton GPU kernel (_zero_kv_blocks_kernel), which crashes on CPU nodes without an active GPU driver.
- The Triton block-zeroing kernel was introduced in #35219 (March 10), but CPUModelRunner lacked a CPU-safe fallback. This caused a TypeError: 'function' object is not subscriptable on the first inference request for all models using the CPU backend.

Closes #37546

Test plan

Verified syntax and pre-commit hooks pass
Implemented a minimal override using PyTorch (tensor.zero_()) to replace the Triton kernel path only for CPU
Existing CPU CI tests cover the integration path

Changed files

vllm/v1/worker/cpu_model_runner.py (modified, +9/-0)

Code Example

Your output of `python collect_env.py` here

---

+ MODEL_DIR=/mnt/models/Qwen3.5-0.8B
+ '[' '!' -d /mnt/models/Qwen3.5-0.8B ']'
+ MODEL_DIR=/mnt/models
+ echo '[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead'
+ python3 -m vllm.entrypoints.openai.api_server --port 8080 --served-model-name qwen35-08b mlops-demo-ai-test/qwen35-08b --model /mnt/models --dtype half --enforce-eager --no-enable-prefix-caching
[WARNING] Model directory /mnt/models/Qwen3.5-0.8B not found, using /mnt/models instead
INFO 03-19 09:54:15 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-19 09:54:15 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]        鈻�     鈻�     鈻堚杽   鈻勨枅
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]  鈻勨杽 鈻勨枅 鈻�     鈻�     鈻� 鈻€鈻勨杸 鈻�  version 0.17.1
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]   鈻堚杽鈻堚杸 鈻�     鈻�     鈻�     鈻�  model   /mnt/models
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302]    鈻€鈻€  鈻€鈻€鈻€鈻€鈻€ 鈻€鈻€鈻€鈻€鈻€ 鈻€     鈻€
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:302] 
(APIServer pid=1) INFO 03-19 09:54:20 [utils.py:238] non-default args: {'port': 8080, 'model': '/mnt/models', 'dtype': 'half', 'enforce_eager': True, 'served_model_name': ['qwen35-08b', 'mlops-demo-ai-test/qwen35-08b'], 'enable_prefix_caching': False}
(APIServer pid=1) WARNING 03-19 09:54:20 [envs.py:1710] Unknown vLLM environment variable detected: VLLM_COMPILE_LEVEL
(APIServer pid=1) INFO 03-19 09:55:03 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration
(APIServer pid=1) WARNING 03-19 09:55:03 [model.py:1892] Casting torch.bfloat16 to torch.float16.
(APIServer pid=1) INFO 03-19 09:55:03 [model.py:1554] Using max model len 262144
(APIServer pid=1) INFO 03-19 09:55:03 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048.
(APIServer pid=1) INFO 03-19 09:55:03 [config.py:544] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size.
(APIServer pid=1) INFO 03-19 09:55:03 [config.py:575] Padding mamba page size by 2.64% to ensure that mamba page size and attention page size are exactly equal.
(APIServer pid=1) INFO 03-19 09:55:03 [vllm.py:747] Asynchronous scheduling is enabled.
(APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=1) WARNING 03-19 09:55:03 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
INFO 03-19 09:55:43 [importing.py:44] Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton to prevent runtime errors.
INFO 03-19 09:55:43 [importing.py:68] Triton not installed or not compatible; certain GPU-related functions will not be available.
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/chat_completion/protocol.py:346: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
/opt/venv/lib/python3.12/site-packages/vllm/entrypoints/openai/completion/protocol.py:176: SyntaxWarning: invalid escape sequence '\e'
  "(e.g. 'abcdabcdabcd...' or '\emoji \emoji \emoji ...'). This feature "
(EngineCore_DP0 pid=157) INFO 03-19 09:55:49 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:210] auto thread-binding list (id, physical core): [(0, 0), (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), (6, 6), (7, 7), (8, 8), (9, 9), (10, 10), (11, 11), (12, 12), (13, 13), (14, 14), (15, 15)]
get_mempolicy: Operation not permitted
[W319 09:55:58.427354717 utils.cpp:76] Warning: numa_migrate_pages failed. errno: 1 (function init_cpu_threads_env)
set_mempolicy: Operation not permitted
[W319 09:55:58.427423335 utils.cpp:100] Warning: numa_set_membind failed. errno: 1 (function init_cpu_threads_env)
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] OMP threads binding of Process 157:
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 157, core 0
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 243, core 1
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 244, core 2
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 245, core 3
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 246, core 4
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 247, core 5
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 248, core 6
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 249, core 7
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 250, core 8
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 251, core 9
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 252, core 10
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 253, core 11
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 254, core 12
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 255, core 13
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 256, core 14
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 	OMP tid: 257, core 15
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [cpu_worker.py:90] 
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.3.4.82:57107 backend=gloo
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
(EngineCore_DP0 pid=157) INFO 03-19 09:55:58 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [base.py:106] Offloader set to NoopOffloader
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [cpu_model_runner.py:62] Starting to load model /mnt/models...
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [interface.py:272] Using default backend AttentionBackendEnum.TORCH_SDPA for vit attention
(EngineCore_DP0 pid=157) INFO 03-19 09:56:11 [mm_encoder_attention.py:215] Using AttentionBackendEnum.TORCH_SDPA for MMEncoderAttention.
(EngineCore_DP0 pid=157) 
Loading safetensors checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
(EngineCore_DP0 pid=157) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.09s/it]
(EngineCore_DP0 pid=157) 
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:03<00:00,  3.09s/it]
(EngineCore_DP0 pid=157) 
(EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [default_loader.py:293] Loading weights took 3.20 seconds
(EngineCore_DP0 pid=157) WARNING 03-19 09:56:15 [utils.py:256] Failed to create oneDNN linear, fallback to torch linear. Exception: could not create a primitive descriptor for the matmul primitive. Run workload with environment variable ONEDNN_VERBOSE=all to get additional diagnostic information.
(EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1314] GPU KV cache size: 87,040 tokens
(EngineCore_DP0 pid=157) INFO 03-19 09:56:15 [kv_cache_utils.py:1319] Maximum concurrency for 262,144 tokens per request: 1.32x
(EngineCore_DP0 pid=157) INFO 03-19 09:56:18 [cpu_model_runner.py:73] Warming up model for the compilation...
(EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [cpu_model_runner.py:83] Warming up done.
(EngineCore_DP0 pid=157) INFO 03-19 09:56:25 [core.py:282] init engine (profile, create kv cache, warmup model) took 9.99 seconds
(EngineCore_DP0 pid=157) INFO 03-19 09:56:26 [vllm.py:747] Asynchronous scheduling is disabled.
(EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:781] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore_DP0 pid=157) WARNING 03-19 09:56:26 [vllm.py:792] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=1) INFO 03-19 09:56:26 [api_server.py:495] Supported tasks: ['generate']
(APIServer pid=1) INFO 03-19 09:56:27 [serving.py:185] Warming up chat template processing...
(APIServer pid=1) INFO 03-19 09:56:30 [hf.py:318] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=1) INFO 03-19 09:56:30 [serving.py:210] Chat template warmup completed in 3482.8ms
(APIServer pid=1) INFO 03-19 09:56:30 [api_server.py:500] Starting vLLM API server 0 on http://0.0.0.0:8080
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:38] Available routes are:
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /redoc, Methods: GET, HEAD
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /tokenize, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /detokenize, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /load, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /version, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /health, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /metrics, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/models, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /ping, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /invocations, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/completions/render, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1) INFO 03-19 09:56:30 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1) INFO:     Started server process [1]
(APIServer pid=1) INFO:     Waiting for application startup.
(APIServer pid=1) INFO:     Application startup complete.
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:72] Dumping input data for V1 LLM engine (v0.17.1) with config: model='/mnt/models', speculative_config=None, tokenizer='/mnt/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cpu, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=qwen35-08b, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'compile_sizes': None, 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': None, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}, 
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:79] Dumping scheduler output for model execution: SchedulerOutput(scheduled_new_reqs=[NewRequestData(req_id=cmpl-a13cefe28c6051ba-0-8061de83,prompt_token_ids_len=1,prefill_token_ids_len=None,mm_features=[],sampling_params=SamplingParams(n=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=1.0, top_p=1.0, top_k=0, min_p=0.0, seed=None, stop=[], stop_token_ids=[248044], bad_words=[], include_stop_str_in_output=False, ignore_eos=False, max_tokens=16, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, structured_outputs=None, extra_args=None),block_ids=([1], [2], [3], [4]),num_computed_tokens=0,lora_request=None,prompt_embeds_shape=None)], scheduled_cached_reqs=CachedRequestData(req_ids=[],resumed_req_ids=set(),new_token_ids_lens=[],all_token_ids_lens={},new_block_ids=[],num_computed_tokens=[],num_output_tokens=[]), num_scheduled_tokens={cmpl-a13cefe28c6051ba-0-8061de83: 1}, total_num_scheduled_tokens=1, scheduled_spec_decode_tokens={}, scheduled_encoder_inputs={}, num_common_prefix_blocks=[0, 0, 0, 0], finished_req_ids=[], free_encoder_mm_hashes=[], preempted_req_ids=[], has_structured_output_requests=false, pending_structured_output_tokens=false, num_invalid_spec_tokens=null, kv_connector_metadata=null, ec_connector_metadata=null, new_block_ids_to_zero=[4])
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [dump_input.py:81] Dumping scheduler stats: SchedulerStats(num_running_reqs=1, num_waiting_reqs=0, step_counter=0, current_wave=0, kv_cache_usage=0.006240249609984372, encoder_cache_usage=0.0, prefix_cache_stats=PrefixCacheStats(reset=False, requests=0, queries=0, hits=0, preempted_requests=0, preempted_queries=0, preempted_hits=0), connector_prefix_cache_stats=None, kv_cache_eviction_events=[], spec_decoding_stats=None, kv_connector_stats=None, waiting_lora_adapters={}, running_lora_adapters={}, cudagraph_stats=None, perf_stats=None)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] EngineCore encountered a fatal error.
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] Traceback (most recent call last):
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     engine_core.run_busy_loop()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._process_engine_step()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     model_output = future.result()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return self.__get_result()
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     raise self._exception
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     return func(*args, **kwargs)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._update_states(scheduler_output)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._zero_block_ids(scheduler_output.new_block_ids_to_zero)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     self._kv_block_zeroer.zero_block_ids(block_ids)
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     _zero_kv_blocks_kernel[grid](
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102]     ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=157) ERROR 03-19 09:56:41 [core.py:1102] TypeError: 'function' object is not subscriptable
(EngineCore_DP0 pid=157) Process EngineCore_DP0:
(EngineCore_DP0 pid=157) Traceback (most recent call last):
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore_DP0 pid=157)     self.run()
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore_DP0 pid=157)     self._target(*self._args, **self._kwargs)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core
(EngineCore_DP0 pid=157)     raise e
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1093, in run_engine_core
(EngineCore_DP0 pid=157)     engine_core.run_busy_loop()
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1128, in run_busy_loop
(EngineCore_DP0 pid=157)     self._process_engine_step()
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1165, in _process_engine_step
(EngineCore_DP0 pid=157)     outputs, model_executed = self.step_fn()
(EngineCore_DP0 pid=157)                               ^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 397, in step
(EngineCore_DP0 pid=157)     model_output = future.result()
(EngineCore_DP0 pid=157)                    ^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 449, in result
(EngineCore_DP0 pid=157)     return self.__get_result()
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/uv/python/cpython-3.12.13-linux-x86_64-gnu/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
(EngineCore_DP0 pid=157)     raise self._exception
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/executor/uniproc_executor.py", line 80, in collective_rpc
(EngineCore_DP0 pid=157)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore_DP0 pid=157)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore_DP0 pid=157)     return func(*args, **kwargs)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/worker_base.py", line 365, in execute_model
(EngineCore_DP0 pid=157)     return self.worker.execute_model(scheduler_output)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157)     return func(*args, **kwargs)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_worker.py", line 728, in execute_model
(EngineCore_DP0 pid=157)     output = self.model_runner.execute_model(
(EngineCore_DP0 pid=157)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(EngineCore_DP0 pid=157)     return func(*args, **kwargs)
(EngineCore_DP0 pid=157)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 3436, in execute_model
(EngineCore_DP0 pid=157)     self._update_states(scheduler_output)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 978, in _update_states
(EngineCore_DP0 pid=157)     self._zero_block_ids(scheduler_output.new_block_ids_to_zero)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/gpu_model_runner.py", line 940, in _zero_block_ids
(EngineCore_DP0 pid=157)     self._kv_block_zeroer.zero_block_ids(block_ids)
(EngineCore_DP0 pid=157)   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/worker/utils.py", line 210, in zero_block_ids
(EngineCore_DP0 pid=157)     _zero_kv_blocks_kernel[grid](
(EngineCore_DP0 pid=157)     ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(EngineCore_DP0 pid=157) TypeError: 'function' object is not subscriptable
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] Traceback (most recent call last):
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/async_llm.py", line 664, in output_handler
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]     outputs = await engine_core.get_output_async()
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]   File "/opt/venv/lib/python3.12/site-packages/vllm/v1/engine/core_client.py", line 1009, in get_output_async
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708]     raise self._format_exception(outputs) from None
(APIServer pid=1) ERROR 03-19 09:56:41 [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue. See stack trace (above) for the root cause.
(APIServer pid=1) INFO:     127.0.0.1:47458 - "POST /v1/completions HTTP/1.1" 500 Internal Server Error
(APIServer pid=1) INFO:     Shutting down
(APIServer pid=1) INFO:     Waiting for application shutdown.
(APIServer pid=1) INFO:     Application shutdown complete.
(APIServer pid=1) INFO:     Finished server process [1]

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

</details>

🐛 Describe the bug

Environment

vLLM version: 0.17.1
Python: 3.12
PyTorch:
Device: CPU (device_config=cpu)
Model: Qwen3_5ForConditionalGeneration (Qwen3.5-VL-0.8B)
OS: Linux (Kubernetes pod, no GPU)

Description

When running vLLM with the CPU backend, the engine crashes on the first inference request with:

The server starts successfully and completes warmup, but dies immediately when a request triggers KV cache block allocation.

</details>

Steps to Reproduce

python3 -m vllm.entrypoints.openai.api_server \
  --model /path/to/qwen3.5-vl \
  --dtype half \
  --enforce-eager \
  --no-enable-prefix-caching

Then send any completion request:
curl -X POST http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "my-model", "prompt": "hello", "max_tokens": 16}'

Actual Behavior

(EngineCore_DP0) ERROR [core.py:1102]   File ".../vllm/v1/worker/gpu_model_runner.py", line 978, in
_update_states
(EngineCore_DP0) ERROR [core.py:1102]
self._zero_block_ids(scheduler_output.new_block_ids_to_zero)
(EngineCore_DP0) ERROR [core.py:1102]   File ".../vllm/v1/worker/gpu_model_runner.py", line 940, in
_zero_block_ids
(EngineCore_DP0) ERROR [core.py:1102]     self._kv_block_zeroer.zero_block_ids(block_ids)
(EngineCore_DP0) ERROR [core.py:1102]   File ".../vllm/v1/worker/utils.py", line 210, in
zero_block_ids
(EngineCore_DP0) ERROR [core.py:1102]     _zero_kv_blocks_kernel[grid](
(EngineCore_DP0) ERROR [core.py:1102]     ~~~~~~~~~~~~~~~~~~~~~~^^^^^^
(EngineCore_DP0) ERROR [core.py:1102] TypeError: 'function' object is not subscriptable

Expected Behavior

CPU backend should handle KV cache block zeroing without calling a Triton GPU kernel.

Root Cause Analysis

Two issues combine to cause this:

1. CPUModelRunner inherits _zero_block_ids from GPUModelRunner without overriding it:

The CPU worker delegates to gpu_worker.py / gpu_model_runner.py for model execution. _zero_block_ids
is only implemented in GPUModelRunner using a Triton kernel (_zero_kv_blocks_kernel in utils.py), and
 CPUModelRunner does not override it with a CPU-safe fallback.

2. Triton is disabled on CPU nodes:

Triton is installed but 0 active driver(s) found (expected 1). Disabling Triton.

When Triton is disabled, @triton.jit becomes a no-op decorator returning a plain Python function.
Calling _zero_kv_blocks_kernel[grid](...) on a plain function fails because regular functions do not
implement __getitem__.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to modify the CPUModelRunner class to override the _zero_block_ids method with a CPU-safe implementation.

Here are the steps:

Identify the CPUModelRunner class in the codebase.
Override the _zero_block_ids method in CPUModelRunner to provide a CPU-safe implementation.
Ensure the new implementation does not rely on Triton GPU kernels.

Example code:

class CPUModelRunner:
    # ...

    def _zero_block_ids(self, block_ids):
        # CPU-safe implementation to zero block IDs
        # For example, using PyTorch tensors
        import torch
        zero_tensor = torch.zeros_like(block_ids)
        block_ids.copy_(zero_tensor)

Alternatively, if the _zero_block_ids method is not necessary for CPU execution, it can be modified to raise a NotImplementedError or return without performing any action.

Verification

To verify the fix, run the same command to start the API server and send a completion request:

python3 -m vllm.entrypoints.openai.api_server \
    --model /path/to/qwen3.5-vl \
    --dtype half \
    --enforce-eager \
    --no-enable-prefix-caching

curl -X POST http://localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "my-model", "prompt": "hello", "max_tokens": 16}'

The API server should now handle the completion request without crashing due to the TypeError.

Extra Tips

When working with GPU-accelerated code on CPU nodes, ensure that all GPU-specific functionality is properly disabled or overridden.
Use logging and debugging tools to identify the root cause of issues and verify fixes.
Consider adding unit tests or integration tests to cover scenarios like this and prevent similar issues in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #optimization #chain error #runtime error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: CPU backend crashes with `TypeError: 'function' object is not subscriptable` on first inference request [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37550: [Bugfix] Fix CPU backend crash in KV cache block zeroing

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

Environment

Description

Steps to Reproduce

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: CPU backend crashes with `TypeError: 'function' object is not subscriptable` on first inference request [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #37550: [Bugfix] Fix CPU backend crash in KV cache block zeroing

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

Environment

Description

Steps to Reproduce

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING