vllm - ✅(Solved) Fix [Bug]: ImportError: flash_attn.ops.triton.rotary not found on older versions (< v2.1.2) [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38056Fetched 2026-04-08 01:26:53
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Error Message

(EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] EngineCore failed to start. (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] Traceback (most recent call last): (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self._init_executor() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.driver_worker.load_model() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model = model_loader.load_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = initialize_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = model_class(vllm_config=vllm_config, prefix=prefix)

Root Cause

vllm serve /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B --port 19995 (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] █▄█▀ █ █ █ █ model /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:238] non-default args: {'model_tag': '/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', 'port': 19995, 'model': '/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) INFO 03-25 01:54:43 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=19) INFO 03-25 01:54:43 [model.py:1554] Using max model len 262144 (APIServer pid=19) INFO 03-25 01:54:43 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=19) INFO 03-25 01:54:43 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=19) INFO 03-25 01:54:43 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=19) INFO 03-25 01:54:43 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (EngineCore_DP0 pid=298) INFO 03-25 01:55:00 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', speculative_config=None, tokenizer='/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=298) INFO 03-25 01:55:02 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.120.84.11:36731 backend=nccl (EngineCore_DP0 pid=298) INFO 03-25 01:55:02 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=298) INFO 03-25 01:55:10 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=298) INFO 03-25 01:55:10 [gpu_model_runner.py:4281] Starting to load model /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B... (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] EngineCore failed to start. (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] Traceback (most recent call last): (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self._init_executor() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.driver_worker.load_model() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model = model_loader.load_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = initialize_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_5.py", line 653, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.visual = Qwen3_VisionTransformer( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 366, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.rotary_pos_emb = get_rope( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/init.py", line 129, in get_rope (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] rotary_emb = RotaryEmbedding( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 129, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 65, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.apply_rotary_emb = ApplyRotaryEmb( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] from flash_attn.ops.triton.rotary import apply_rotary (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] ModuleNotFoundError: No module named 'flash_attn.ops.triton' (EngineCore_DP0 pid=298) Process EngineCore_DP0: (EngineCore_DP0 pid=298) Traceback (most recent call last): (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=298) self.run() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=298) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=298) raise e (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) super().init( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) self._init_executor() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) self.driver_worker.load_model() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) self.model = model_loader.load_model( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) model = initialize_model( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_5.py", line 653, in init (EngineCore_DP0 pid=298) self.visual = Qwen3_VisionTransformer( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 366, in init (EngineCore_DP0 pid=298) self.rotary_pos_emb = get_rope( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/init.py", line 129, in get_rope (EngineCore_DP0 pid=298) rotary_emb = RotaryEmbedding( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 129, in init (EngineCore_DP0 pid=298) super().init( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 65, in init (EngineCore_DP0 pid=298) self.apply_rotary_emb = ApplyRotaryEmb( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in init (EngineCore_DP0 pid=298) from flash_attn.ops.triton.rotary import apply_rotary (EngineCore_DP0 pid=298) ModuleNotFoundError: No module named 'flash_attn.ops.triton' [rank0]:[W325 01:55:12.578465682 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=19) Traceback (most recent call last): (APIServer pid=19) File "/root/miniconda3/bin/vllm", line 8, in <module> (APIServer pid=19) sys.exit(main()) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main (APIServer pid=19) args.dispatch_function(args) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd (APIServer pid=19) uvloop.run(run_server(args)) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/init.py", line 82, in run (APIServer pid=19) return loop.run_until_complete(wrapper()) (APIServer pid=19) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper (APIServer pid=19) return await main (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=19) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=19) async with build_async_engine_client( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in aenter (APIServer pid=19) return await anext(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=19) async with build_async_engine_client_from_engine_args( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in aenter (APIServer pid=19) return await anext(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=19) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=19) return cls( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=19) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=19) return func(*args, **kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client (APIServer pid=19) return AsyncMPClient(*client_args) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=19) return func(*args, **kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 911, in init (APIServer pid=19) super().init( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 569, in init (APIServer pid=19) with launch_core_engines( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 142, in exit (APIServer pid=19) next(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines (APIServer pid=19) wait_for_engine_startup( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup (APIServer pid=19) raise RuntimeError( (APIServer pid=19) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz Stepping: 6 CPU MHz: 2900.000 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5800.00 Virtualization: VT-x L1d cache: 1.5 MiB L1i cache: 1 MiB L2 cache: 40 MiB L3 cache: 48 MiB NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

vllm serve /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B --port 19995 (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] █▄█▀ █ █ █ █ model /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:238] non-default args: {'model_tag': '/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', 'port': 19995, 'model': '/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) INFO 03-25 01:54:43 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=19) INFO 03-25 01:54:43 [model.py:1554] Using max model len 262144 (APIServer pid=19) INFO 03-25 01:54:43 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=19) INFO 03-25 01:54:43 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=19) INFO 03-25 01:54:43 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=19) INFO 03-25 01:54:43 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (EngineCore_DP0 pid=298) INFO 03-25 01:55:00 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', speculative_config=None, tokenizer='/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=298) INFO 03-25 01:55:02 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.120.84.11:36731 backend=nccl (EngineCore_DP0 pid=298) INFO 03-25 01:55:02 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=298) INFO 03-25 01:55:10 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=298) INFO 03-25 01:55:10 [gpu_model_runner.py:4281] Starting to load model /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B... (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] EngineCore failed to start. (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] Traceback (most recent call last): (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self._init_executor() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.driver_worker.load_model() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model = model_loader.load_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = initialize_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_5.py", line 653, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.visual = Qwen3_VisionTransformer( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 366, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.rotary_pos_emb = get_rope( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/init.py", line 129, in get_rope (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] rotary_emb = RotaryEmbedding( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 129, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 65, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.apply_rotary_emb = ApplyRotaryEmb( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] from flash_attn.ops.triton.rotary import apply_rotary (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] ModuleNotFoundError: No module named 'flash_attn.ops.triton' (EngineCore_DP0 pid=298) Process EngineCore_DP0: (EngineCore_DP0 pid=298) Traceback (most recent call last): (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=298) self.run() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=298) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=298) raise e (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) super().init( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) self._init_executor() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) self.driver_worker.load_model() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) self.model = model_loader.load_model( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) model = initialize_model( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_5.py", line 653, in init (EngineCore_DP0 pid=298) self.visual = Qwen3_VisionTransformer( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 366, in init (EngineCore_DP0 pid=298) self.rotary_pos_emb = get_rope( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/init.py", line 129, in get_rope (EngineCore_DP0 pid=298) rotary_emb = RotaryEmbedding( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 129, in init (EngineCore_DP0 pid=298) super().init( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 65, in init (EngineCore_DP0 pid=298) self.apply_rotary_emb = ApplyRotaryEmb( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in init (EngineCore_DP0 pid=298) from flash_attn.ops.triton.rotary import apply_rotary (EngineCore_DP0 pid=298) ModuleNotFoundError: No module named 'flash_attn.ops.triton' [rank0]:[W325 01:55:12.578465682 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=19) Traceback (most recent call last): (APIServer pid=19) File "/root/miniconda3/bin/vllm", line 8, in <module> (APIServer pid=19) sys.exit(main()) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main (APIServer pid=19) args.dispatch_function(args) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd (APIServer pid=19) uvloop.run(run_server(args)) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/init.py", line 82, in run (APIServer pid=19) return loop.run_until_complete(wrapper()) (APIServer pid=19) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper (APIServer pid=19) return await main (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=19) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=19) async with build_async_engine_client( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in aenter (APIServer pid=19) return await anext(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=19) async with build_async_engine_client_from_engine_args( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in aenter (APIServer pid=19) return await anext(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=19) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=19) return cls( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=19) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=19) return func(*args, **kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client (APIServer pid=19) return AsyncMPClient(*client_args) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=19) return func(*args, **kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 911, in init (APIServer pid=19) super().init( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 569, in init (APIServer pid=19) with launch_core_engines( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 142, in exit (APIServer pid=19) next(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines (APIServer pid=19) wait_for_engine_startup( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup (APIServer pid=19) raise RuntimeError( (APIServer pid=19) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

PR fix notes

PR #38091: [Bugfix] Fix ImportError for flash_attn < v2.1.2 missing triton rotary module

Description (problem / solution / changelog)

Summary

Fix ImportError / ModuleNotFoundError when flash_attn is installed but the flash_attn.ops.triton.rotary submodule is not available (e.g., flash_attn < v2.1.2 or incomplete CUDA build).

Root cause: The old code uses find_spec("flash_attn") to check if the package exists, then unconditionally imports flash_attn.ops.triton.rotary. When flash_attn is installed but the submodule doesn't exist, this crashes with ImportError.

Fix: Replace find_spec + direct import with try/except, falling back to vLLM's internal implementation (vllm.vllm_flash_attn.ops.triton.rotary) which already provides an equivalent apply_rotary function. This follows the same pattern used elsewhere in the codebase (e.g., vllm/v1/attention/backends/fa_utils.py).

Fixes #38056

Before / After

<details> <summary>Before (old code — crashes)</summary>
=== Before Fix: Old code path ===

find_spec("flash_attn"): ModuleSpec(name='flash_attn', ...)
  → flash_attn IS found, so old code proceeds to import...

Attempting: from flash_attn.ops.triton.rotary import apply_rotary
Traceback (most recent call last):
  File "<string>", line 15, in <module>
  File ".../flash_attn/__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File ".../flash_attn/flash_attn_interface.py", line 15, in <module>
    import flash_attn_2_cuda as flash_attn_gpu
ModuleNotFoundError: No module named 'flash_attn_2_cuda'
</details> <details> <summary>After (fixed code — graceful fallback)</summary>
=== After Fix: New code path (try/except with fallback) ===

✗ flash_attn.ops.triton.rotary not available: No module named 'flash_attn_2_cuda'
✓ Fallback to vllm.vllm_flash_attn.ops.triton.rotary

Result: apply_rotary_emb_flash_attn = <function apply_rotary at 0x7f3df7b691c0>
  → Triton-based rotary embedding will be used (fast path)
</details>

Test plan

  • pre-commit run ruff-check — passed
  • pre-commit run ruff-format — passed
  • pre-commit run mypy-3.10 --hook-stage manual — passed
  • Verified: with flash_attn installed but submodule unavailable, old code crashes, new code falls back gracefully
  • Fallback chain: external flash_attn → vllm internal flash_attn → PyTorch native (forward_native)

AI Assistance Disclosure

This PR was developed with AI assistance (Claude). All changes have been manually reviewed and tested.

Changed files

  • vllm/model_executor/layers/rotary_embedding/common.py (modified, +8/-2)

Code Example

Your output of `python collect_env.py` here
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary> Name: vllm Version: 0.17.1 Summary: A high-throughput and memory-efficient inference and serving engine for LLMs Home-page: Author: vLLM Team Author-email: License: Location: /root/miniconda3/lib/python3.10/site-packages Requires: aiohttp, anthropic, blake3, cachetools, cbor2, cloudpickle, compressed-tensors, depyf, diskcache, einops, fastapi, filelock, flashinfer-python, gguf, grpcio, grpcio-reflection, ijson, kaldi-native-fbank, lark, llguidance, lm-format-enforcer, mcp, mistral_common, model-hosting-container-standards, msgspec, ninja, numba, numpy, nvidia-cudnn-frontend, nvidia-cutlass-dsl, openai, openai-harmony, opencv-python-headless, opentelemetry-api, opentelemetry-exporter-otlp, opentelemetry-sdk, opentelemetry-semantic-conventions-ai, outlines_core, partial-json-parser, pillow, prometheus-fastapi-instrumentator, prometheus_client, protobuf, psutil, py-cpuinfo, pybase64, pydantic, python-json-logger, pyyaml, pyzmq, quack-kernels, ray, regex, requests, sentencepiece, setproctitle, tiktoken, tokenizers, torch, torchaudio, torchvision, tqdm, transformers, typing_extensions, watchfiles, xgrammar Required-by: (base) root@real-347:/workspace# python3 /root/miniconda3/lib/python3.10/site-packages/vllm/co collect_env.py compilation/ config/ connections.py (base) root@real-347:/workspace# python3 /root/miniconda3/lib/python3.10/site-packages/vllm/co collect_env.py compilation/ config/ connections.py (base) root@real-347:/workspace# python3 /root/miniconda3/lib/python3.10/site-packages/vllm/collect_env.py Collecting environment information... ============================== System Info ============================== OS : Ubuntu 20.04.6 LTS (x86_64) GCC version : (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version : Could not collect CMake version : Could not collect Libc version : glibc-2.31

============================== PyTorch Info

PyTorch version : 2.10.0+cu128 Is debug build : False CUDA used to build PyTorch : 12.8 ROCM used to build PyTorch : N/A

============================== Python Environment

Python version : 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0] (64-bit runtime) Python platform : Linux-5.15.0-113-generic-x86_64-with-glibc2.31

============================== CUDA / GPU Info

Is CUDA available : True CUDA runtime version : Could not collect CUDA_MODULE_LOADING set to : GPU models and configuration : GPU 0: NVIDIA GeForce RTX 4090 GPU 1: NVIDIA GeForce RTX 4090 GPU 2: NVIDIA GeForce RTX 4090 GPU 3: NVIDIA GeForce RTX 4090 GPU 4: NVIDIA GeForce RTX 4090 GPU 5: NVIDIA GeForce RTX 4090 GPU 6: NVIDIA GeForce RTX 4090 GPU 7: NVIDIA GeForce RTX 4090

Nvidia driver version : 550.90.07 cuDNN version : Could not collect HIP runtime version : N/A MIOpen runtime version : N/A Is XNNPACK available : True

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 52 bits physical, 57 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 16 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 106 Model name: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz Stepping: 6 CPU MHz: 2900.000 CPU max MHz: 3500.0000 CPU min MHz: 800.0000 BogoMIPS: 5800.00 Virtualization: VT-x L1d cache: 1.5 MiB L1i cache: 1 MiB L2 cache: 40 MiB L3 cache: 48 MiB NUMA node0 CPU(s): 0-15,32-47 NUMA node1 CPU(s): 16-31,48-63 Vulnerability Gather data sampling: Mitigation; Microcode Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Mitigation; Clear CPU buffers; SMT vulnerable Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Not affected Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced IBRS; IBPB conditional; RSB filling; PBRSB-eIBRS SW sequence; BHI Syscall hardening, KVM SW loop Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 invpcid_single ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid fsrm md_clear pconfig flush_l1d arch_capabilities

============================== Versions of relevant libraries

[pip3] flashinfer-python==0.6.4 [pip3] numpy==1.26.4 [pip3] nvidia-cublas-cu12==12.8.4.1 [pip3] nvidia-cuda-cupti-cu12==12.8.90 [pip3] nvidia-cuda-nvrtc-cu12==12.8.93 [pip3] nvidia-cuda-runtime-cu12==12.8.90 [pip3] nvidia-cudnn-cu12==9.10.2.21 [pip3] nvidia-cudnn-frontend==1.18.0 [pip3] nvidia-cufft-cu12==11.3.3.83 [pip3] nvidia-cufile-cu12==1.13.1.3 [pip3] nvidia-curand-cu12==10.3.9.90 [pip3] nvidia-cusolver-cu12==11.7.3.90 [pip3] nvidia-cusparse-cu12==12.5.8.93 [pip3] nvidia-cusparselt-cu12==0.7.1 [pip3] nvidia-cutlass-dsl==4.4.2 [pip3] nvidia-cutlass-dsl-libs-base==4.4.2 [pip3] nvidia-ml-py==13.590.48 [pip3] nvidia-nccl-cu12==2.27.5 [pip3] nvidia-nvjitlink-cu12==12.8.93 [pip3] nvidia-nvshmem-cu12==3.4.5 [pip3] nvidia-nvtx-cu12==12.8.90 [pip3] pynvml==11.5.3 [pip3] pyzmq==27.1.0 [pip3] torch==2.10.0 [pip3] torch_c_dlpack_ext==0.1.5 [pip3] torchaudio==2.10.0 [pip3] torchvision==0.25.0 [pip3] transformers==5.3.0 [pip3] triton==3.6.0 [conda] flashinfer-python 0.6.4 pypi_0 pypi [conda] numpy 1.26.4 pypi_0 pypi [conda] nvidia-cublas-cu12 12.8.4.1 pypi_0 pypi [conda] nvidia-cuda-cupti-cu12 12.8.90 pypi_0 pypi [conda] nvidia-cuda-nvrtc-cu12 12.8.93 pypi_0 pypi [conda] nvidia-cuda-runtime-cu12 12.8.90 pypi_0 pypi [conda] nvidia-cudnn-cu12 9.10.2.21 pypi_0 pypi [conda] nvidia-cudnn-frontend 1.18.0 pypi_0 pypi [conda] nvidia-cufft-cu12 11.3.3.83 pypi_0 pypi [conda] nvidia-cufile-cu12 1.13.1.3 pypi_0 pypi [conda] nvidia-curand-cu12 10.3.9.90 pypi_0 pypi [conda] nvidia-cusolver-cu12 11.7.3.90 pypi_0 pypi [conda] nvidia-cusparse-cu12 12.5.8.93 pypi_0 pypi [conda] nvidia-cusparselt-cu12 0.7.1 pypi_0 pypi [conda] nvidia-cutlass-dsl 4.4.2 pypi_0 pypi [conda] nvidia-cutlass-dsl-libs-base 4.4.2 pypi_0 pypi [conda] nvidia-ml-py 13.590.48 pypi_0 pypi [conda] nvidia-nccl-cu12 2.27.5 pypi_0 pypi [conda] nvidia-nvjitlink-cu12 12.8.93 pypi_0 pypi [conda] nvidia-nvshmem-cu12 3.4.5 pypi_0 pypi [conda] nvidia-nvtx-cu12 12.8.90 pypi_0 pypi [conda] pynvml 11.5.3 pypi_0 pypi [conda] pyzmq 27.1.0 pypi_0 pypi [conda] torch 2.10.0 pypi_0 pypi [conda] torch-c-dlpack-ext 0.1.5 pypi_0 pypi [conda] torchaudio 2.10.0 pypi_0 pypi [conda] torchvision 0.25.0 pypi_0 pypi [conda] transformers 5.3.0 pypi_0 pypi [conda] triton 3.6.0 pypi_0 pypi

============================== vLLM Info

ROCM Version : Could not collect vLLM Version : 0.17.1 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X PXB PXB PXB SYS SYS SYS SYS SYS 0-15,32-47 0 N/A GPU1 PXB X PXB PXB SYS SYS SYS SYS SYS 0-15,32-47 0 N/A GPU2 PXB PXB X PIX SYS SYS SYS SYS SYS 0-15,32-47 0 N/A GPU3 PXB PXB PIX X SYS SYS SYS SYS SYS 0-15,32-47 0 N/A GPU4 SYS SYS SYS SYS X PXB PXB PXB NODE 16-31,48-63 1 N/A GPU5 SYS SYS SYS SYS PXB X PXB PXB NODE 16-31,48-63 1 N/A GPU6 SYS SYS SYS SYS PXB PXB X PIX NODE 16-31,48-63 1 N/A GPU7 SYS SYS SYS SYS PXB PXB PIX X NODE 16-31,48-63 1 N/A NIC0 SYS SYS SYS SYS NODE NODE NODE NODE X

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_bond_0

============================== Environment Variables

NVIDIA_VISIBLE_DEVICES=all NVIDIA_REQUIRE_CUDA=cuda>=12.1 brand=tesla,driver>=470,driver<471 brand=unknown,driver>=470,driver<471 brand=nvidia,driver>=470,driver<471 brand=nvidiartx,driver>=470,driver<471 brand=geforce,driver>=470,driver<471 brand=geforcertx,driver>=470,driver<471 brand=quadro,driver>=470,driver<471 brand=quadrortx,driver>=470,driver<471 brand=titan,driver>=470,driver<471 brand=titanrtx,driver>=470,driver<471 brand=tesla,driver>=525,driver<526 brand=unknown,driver>=525,driver<526 brand=nvidia,driver>=525,driver<526 brand=nvidiartx,driver>=525,driver<526 brand=geforce,driver>=525,driver<526 brand=geforcertx,driver>=525,driver<526 brand=quadro,driver>=525,driver<526 brand=quadrortx,driver>=525,driver<526 brand=titan,driver>=525,driver<526 brand=titanrtx,driver>=525,driver<526 NVIDIA_DRIVER_CAPABILITIES=compute,utility CUDA_VERSION=12.1.1 CUDA_VISIBLE_DEVICES=5 CUDA_VISIBLE_DEVICES=5 LD_LIBRARY_PATH=/usr/local/nvidia/lib:/usr/local/nvidia/lib64 PYTORCH_NVML_BASED_CUDA_CHECK=1 TORCHINDUCTOR_COMPILE_THREADS=1 TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_root

Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

vllm serve /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B --port 19995 (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] █ █ █▄ ▄█ (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.17.1 (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] █▄█▀ █ █ █ █ model /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀ (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:302] (APIServer pid=19) INFO 03-25 01:54:36 [utils.py:238] non-default args: {'model_tag': '/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', 'port': 19995, 'model': '/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) INFO 03-25 01:54:43 [model.py:531] Resolved architecture: Qwen3_5ForConditionalGeneration (APIServer pid=19) INFO 03-25 01:54:43 [model.py:1554] Using max model len 262144 (APIServer pid=19) INFO 03-25 01:54:43 [scheduler.py:231] Chunked prefill is enabled with max_num_batched_tokens=2048. (APIServer pid=19) INFO 03-25 01:54:43 [config.py:544] Setting attention block size to 528 tokens to ensure that attention page size is >= mamba page size. (APIServer pid=19) INFO 03-25 01:54:43 [config.py:575] Padding mamba page size by 0.76% to ensure that mamba page size and attention page size are exactly equal. (APIServer pid=19) INFO 03-25 01:54:43 [vllm.py:747] Asynchronous scheduling is enabled. (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (APIServer pid=19) Unrecognized keys in rope_parameters for 'rope_type'='default': {'mrope_section', 'mrope_interleaved'} (EngineCore_DP0 pid=298) INFO 03-25 01:55:00 [core.py:101] Initializing a V1 LLM engine (v0.17.1) with config: model='/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', speculative_config=None, tokenizer='/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=/home/xiongjie/run_vlm/Qwen/Qwen3.5-4B, enable_prefix_caching=False, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []} (EngineCore_DP0 pid=298) INFO 03-25 01:55:02 [parallel_state.py:1393] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.120.84.11:36731 backend=nccl (EngineCore_DP0 pid=298) INFO 03-25 01:55:02 [parallel_state.py:1715] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A (EngineCore_DP0 pid=298) INFO 03-25 01:55:10 [base.py:106] Offloader set to NoopOffloader (EngineCore_DP0 pid=298) INFO 03-25 01:55:10 [gpu_model_runner.py:4281] Starting to load model /home/xiongjie/run_vlm/Qwen/Qwen3.5-4B... (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] EngineCore failed to start. (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] Traceback (most recent call last): (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self._init_executor() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.driver_worker.load_model() (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.model = model_loader.load_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = initialize_model( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] return func(*args, **kwargs) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_5.py", line 653, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.visual = Qwen3_VisionTransformer( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 366, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.rotary_pos_emb = get_rope( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/init.py", line 129, in get_rope (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] rotary_emb = RotaryEmbedding( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 129, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] super().init( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 65, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] self.apply_rotary_emb = ApplyRotaryEmb( (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in init (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] from flash_attn.ops.triton.rotary import apply_rotary (EngineCore_DP0 pid=298) ERROR 03-25 01:55:11 [core.py:1100] ModuleNotFoundError: No module named 'flash_attn.ops.triton' (EngineCore_DP0 pid=298) Process EngineCore_DP0: (EngineCore_DP0 pid=298) Traceback (most recent call last): (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap (EngineCore_DP0 pid=298) self.run() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/multiprocessing/process.py", line 108, in run (EngineCore_DP0 pid=298) self._target(*self._args, **self._kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1104, in run_engine_core (EngineCore_DP0 pid=298) raise e (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 1090, in run_engine_core (EngineCore_DP0 pid=298) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 834, in init (EngineCore_DP0 pid=298) super().init( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core.py", line 110, in init (EngineCore_DP0 pid=298) self.model_executor = executor_class(vllm_config) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/abstract.py", line 103, in init (EngineCore_DP0 pid=298) self._init_executor() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/executor/uniproc_executor.py", line 49, in _init_executor (EngineCore_DP0 pid=298) self.driver_worker.load_model() (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_worker.py", line 337, in load_model (EngineCore_DP0 pid=298) self.model_runner.load_model(load_dummy_weights=dummy_weights) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/worker/gpu_model_runner.py", line 4297, in load_model (EngineCore_DP0 pid=298) self.model = model_loader.load_model( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/base_loader.py", line 54, in load_model (EngineCore_DP0 pid=298) model = initialize_model( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore_DP0 pid=298) return func(*args, **kwargs) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/model_loader/utils.py", line 56, in initialize_model (EngineCore_DP0 pid=298) model = model_class(vllm_config=vllm_config, prefix=prefix) (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_5.py", line 653, in init (EngineCore_DP0 pid=298) self.visual = Qwen3_VisionTransformer( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/models/qwen3_vl.py", line 366, in init (EngineCore_DP0 pid=298) self.rotary_pos_emb = get_rope( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/init.py", line 129, in get_rope (EngineCore_DP0 pid=298) rotary_emb = RotaryEmbedding( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 129, in init (EngineCore_DP0 pid=298) super().init( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/base.py", line 65, in init (EngineCore_DP0 pid=298) self.apply_rotary_emb = ApplyRotaryEmb( (EngineCore_DP0 pid=298) File "/root/miniconda3/lib/python3.10/site-packages/vllm/model_executor/layers/rotary_embedding/common.py", line 138, in init (EngineCore_DP0 pid=298) from flash_attn.ops.triton.rotary import apply_rotary (EngineCore_DP0 pid=298) ModuleNotFoundError: No module named 'flash_attn.ops.triton' [rank0]:[W325 01:55:12.578465682 ProcessGroupNCCL.cpp:1553] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=19) Traceback (most recent call last): (APIServer pid=19) File "/root/miniconda3/bin/vllm", line 8, in <module> (APIServer pid=19) sys.exit(main()) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/main.py", line 73, in main (APIServer pid=19) args.dispatch_function(args) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/cli/serve.py", line 112, in cmd (APIServer pid=19) uvloop.run(run_server(args)) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/init.py", line 82, in run (APIServer pid=19) return loop.run_until_complete(wrapper()) (APIServer pid=19) File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/uvloop/init.py", line 61, in wrapper (APIServer pid=19) return await main (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 471, in run_server (APIServer pid=19) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 490, in run_server_worker (APIServer pid=19) async with build_async_engine_client( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in aenter (APIServer pid=19) return await anext(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 96, in build_async_engine_client (APIServer pid=19) async with build_async_engine_client_from_engine_args( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 199, in aenter (APIServer pid=19) return await anext(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 137, in build_async_engine_client_from_engine_args (APIServer pid=19) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=19) return cls( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=19) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=19) return func(*args, **kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 127, in make_async_mp_client (APIServer pid=19) return AsyncMPClient(*client_args) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=19) return func(*args, **kwargs) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 911, in init (APIServer pid=19) super().init( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/core_client.py", line 569, in init (APIServer pid=19) with launch_core_engines( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/contextlib.py", line 142, in exit (APIServer pid=19) next(self.gen) (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 951, in launch_core_engines (APIServer pid=19) wait_for_engine_startup( (APIServer pid=19) File "/root/miniconda3/lib/python3.10/site-packages/vllm/v1/engine/utils.py", line 1010, in wait_for_engine_startup (APIServer pid=19) raise RuntimeError( (APIServer pid=19) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The error message indicates a ModuleNotFoundError: No module named 'flash_attn.ops.triton'. This suggests that the flash_attn library is not installed or not properly imported.

To fix this issue, you need to install the flash-attn library. You can do this by running the following command:

pip install flash-attn

If you are using a conda environment, you can install it using:

conda install -c conda-forge flash-attn

After installing the library, try running your code again to see if the issue is resolved.

Verification

To verify that the fix worked, you can try running the vllm serve command again with the same arguments. If the issue is resolved, you should no longer see the ModuleNotFoundError message.

Extra Tips

  • Make sure you have the latest version of the flash-attn library installed.
  • If you are using a virtual environment, ensure that the library is installed in the correct environment.
  • If you are still encountering issues, try checking the documentation for the vllm library to see if there are any specific installation or configuration requirements.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING