vllm - ✅(Solved) Fix [Bug][ROCm]: Aiter unified attention fails during compilation [2 pull requests, 4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37548Fetched 2026-04-08 01:02:14
View on GitHub
Comments
4
Participants
3
Timeline
20
Reactions
0
Author
Timeline (top)
commented ×4mentioned ×4subscribed ×4project_v2_item_status_changed ×3

Error Message

/app/ir-ops/aiter/aiter/ops/triton/_triton_kernels/attention/unified_attention.py:54:0: error: Failures have been detected while processing an MLIR pass pipeline /app/ir-ops/aiter/aiter/ops/triton/_triton_kernels/attention/unified_attention.py:54:0: note: Pipeline failed while executing [ConvertTritonAMDGPUToLLVM on 'builtin.module' operation]: reproducer generated at std::errs, please share the reproducer above with Triton project.

Capturing CUDA graphs (decode, FULL): 0%| | 0/51 [00:00<?, ?it/s] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] EngineCore failed to start. (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] Traceback (most recent call last): (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] super().init( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in init (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model") (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._capture_cudagraphs( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._warmup_and_capture( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._dummy_run( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] outputs = self.model( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self.runnable(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] model_output = self.model( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs) # type: ignore[arg-type] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_with_optional_nvtx_range( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return callable_fn(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] def forward( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return fn(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self.optimized_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] raise e (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "<eval_with_key>.66", line 496, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] raise e (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "<eval_with_key>.2", line 6, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update); query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._op(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.impl.forward( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.unified_attention( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel_unified_attention_2d[ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda> (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel = self.compile(src, target=target, options=options.dict) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] next_module = compile_ir(module, metadata) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda> (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] pm.run(mod) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] RuntimeError: PassManager::run failed (EngineCore pid=270048) Process EngineCore: (EngineCore pid=270048) Traceback (most recent call last): (EngineCore pid=270048) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=270048) self.run() (EngineCore pid=270048) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=270048) self._target(*self._args, **self._kwargs) (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=270048) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=270048) super().init( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in init (EngineCore pid=270048) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches (EngineCore pid=270048) self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore pid=270048) compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model") (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc (EngineCore pid=270048) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model (EngineCore pid=270048) cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model (EngineCore pid=270048) self._capture_cudagraphs( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs (EngineCore pid=270048) self._warmup_and_capture( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture (EngineCore pid=270048) self._dummy_run( (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run (EngineCore pid=270048) outputs = self.model( (EngineCore pid=270048) ^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in call (EngineCore pid=270048) return self.runnable(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward (EngineCore pid=270048) model_output = self.model( (EngineCore pid=270048) ^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in call (EngineCore pid=270048) return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs) # type: ignore[arg-type] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in call (EngineCore pid=270048) return self._call_with_optional_nvtx_range( (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=270048) return callable_fn(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward (EngineCore pid=270048) def forward( (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn (EngineCore pid=270048) return fn(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in call (EngineCore pid=270048) return self.optimized_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "<eval_with_key>.66", line 496, in forward (EngineCore pid=270048) submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "<eval_with_key>.2", line 6, in forward (EngineCore pid=270048) unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update); query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call (EngineCore pid=270048) return self._op(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output (EngineCore pid=270048) self.impl.forward( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward (EngineCore pid=270048) self.unified_attention( (EngineCore pid=270048) File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention (EngineCore pid=270048) kernel_unified_attention_2d[ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda> (EngineCore pid=270048) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run (EngineCore pid=270048) kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile (EngineCore pid=270048) kernel = self.compile(src, target=target, options=options.dict) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile (EngineCore pid=270048) next_module = compile_ir(module, metadata) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda> (EngineCore pid=270048) stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir (EngineCore pid=270048) pm.run(mod) (EngineCore pid=270048) RuntimeError: PassManager::run failed [rank0]:[W319 10:11:40.372379106 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=269504) Traceback (most recent call last): (APIServer pid=269504) File "/usr/local/bin/vllm", line 33, in <module> (APIServer pid=269504) sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=269504) args.dispatch_function(args) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd (APIServer pid=269504) uvloop.run(run_server(args)) (APIServer pid=269504) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=269504) return __asyncio.run( (APIServer pid=269504) ^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=269504) return runner.run(main) (APIServer pid=269504) ^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=269504) return self._loop.run_until_complete(task) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=269504) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=269504) return await main (APIServer pid=269504) ^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 675, in run_server (APIServer pid=269504) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker (APIServer pid=269504) async with build_async_engine_client( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=269504) return await anext(self.gen) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client (APIServer pid=269504) async with build_async_engine_client_from_engine_args( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=269504) return await anext(self.gen) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args (APIServer pid=269504) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=269504) return cls( (APIServer pid=269504) ^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=269504) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=269504) return func(*args, **kwargs) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client (APIServer pid=269504) return AsyncMPClient(*client_args) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=269504) return func(*args, **kwargs) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 924, in init (APIServer pid=269504) super().init( (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 583, in init (APIServer pid=269504) with launch_core_engines( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=269504) next(self.gen) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines (APIServer pid=269504) wait_for_engine_startup( (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup (APIServer pid=269504) raise RuntimeError( (APIServer pid=269504) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

</details>

Additionally, all test_attention_quant_pattern are failing failing for AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN

Root Cause

Capturing CUDA graphs (decode, FULL): 0%| | 0/51 [00:00<?, ?it/s] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] EngineCore failed to start. (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] Traceback (most recent call last): (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] super().init( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in init (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model") (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._capture_cudagraphs( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._warmup_and_capture( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._dummy_run( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] outputs = self.model( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self.runnable(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] model_output = self.model( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs) # type: ignore[arg-type] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_with_optional_nvtx_range( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return callable_fn(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] def forward( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return fn(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self.optimized_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] raise e (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "<eval_with_key>.66", line 496, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] raise e (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "<eval_with_key>.2", line 6, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update); query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._op(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.impl.forward( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.unified_attention( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel_unified_attention_2d[ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda> (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel = self.compile(src, target=target, options=options.dict) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] next_module = compile_ir(module, metadata) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda> (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] pm.run(mod) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] RuntimeError: PassManager::run failed (EngineCore pid=270048) Process EngineCore: (EngineCore pid=270048) Traceback (most recent call last): (EngineCore pid=270048) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=270048) self.run() (EngineCore pid=270048) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=270048) self._target(*self._args, **self._kwargs) (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=270048) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=270048) super().init( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in init (EngineCore pid=270048) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches (EngineCore pid=270048) self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore pid=270048) compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model") (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc (EngineCore pid=270048) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model (EngineCore pid=270048) cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model (EngineCore pid=270048) self._capture_cudagraphs( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs (EngineCore pid=270048) self._warmup_and_capture( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture (EngineCore pid=270048) self._dummy_run( (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run (EngineCore pid=270048) outputs = self.model( (EngineCore pid=270048) ^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in call (EngineCore pid=270048) return self.runnable(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward (EngineCore pid=270048) model_output = self.model( (EngineCore pid=270048) ^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in call (EngineCore pid=270048) return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs) # type: ignore[arg-type] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in call (EngineCore pid=270048) return self._call_with_optional_nvtx_range( (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=270048) return callable_fn(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward (EngineCore pid=270048) def forward( (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn (EngineCore pid=270048) return fn(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in call (EngineCore pid=270048) return self.optimized_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "<eval_with_key>.66", line 496, in forward (EngineCore pid=270048) submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "<eval_with_key>.2", line 6, in forward (EngineCore pid=270048) unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update); query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call (EngineCore pid=270048) return self._op(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output (EngineCore pid=270048) self.impl.forward( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward (EngineCore pid=270048) self.unified_attention( (EngineCore pid=270048) File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention (EngineCore pid=270048) kernel_unified_attention_2d[ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda> (EngineCore pid=270048) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run (EngineCore pid=270048) kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile (EngineCore pid=270048) kernel = self.compile(src, target=target, options=options.dict) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile (EngineCore pid=270048) next_module = compile_ir(module, metadata) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda> (EngineCore pid=270048) stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir (EngineCore pid=270048) pm.run(mod) (EngineCore pid=270048) RuntimeError: PassManager::run failed [rank0]:[W319 10:11:40.372379106 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=269504) Traceback (most recent call last): (APIServer pid=269504) File "/usr/local/bin/vllm", line 33, in <module> (APIServer pid=269504) sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=269504) args.dispatch_function(args) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd (APIServer pid=269504) uvloop.run(run_server(args)) (APIServer pid=269504) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=269504) return __asyncio.run( (APIServer pid=269504) ^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=269504) return runner.run(main) (APIServer pid=269504) ^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=269504) return self._loop.run_until_complete(task) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=269504) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=269504) return await main (APIServer pid=269504) ^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 675, in run_server (APIServer pid=269504) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker (APIServer pid=269504) async with build_async_engine_client( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=269504) return await anext(self.gen) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client (APIServer pid=269504) async with build_async_engine_client_from_engine_args( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=269504) return await anext(self.gen) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args (APIServer pid=269504) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=269504) return cls( (APIServer pid=269504) ^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=269504) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=269504) return func(*args, **kwargs) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client (APIServer pid=269504) return AsyncMPClient(*client_args) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=269504) return func(*args, **kwargs) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 924, in init (APIServer pid=269504) super().init( (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 583, in init (APIServer pid=269504) with launch_core_engines( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=269504) next(self.gen) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines (APIServer pid=269504) wait_for_engine_startup( (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup (APIServer pid=269504) raise RuntimeError( (APIServer pid=269504) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

</details>

Fix Action

Fix / Workaround

============================== CPU Info

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Address sizes: 52 bits physical, 57 bits virtual Byte Order: Little Endian CPU(s): 192 On-line CPU(s) list: 0-191 Vendor ID: AuthenticAMD Model name: AMD EPYC 9654 96-Core Processor CPU family: 25 Model: 17 Thread(s) per core: 1 Core(s) per socket: 96 Socket(s): 2 Stepping: 1 Frequency boost: enabled CPU max MHz: 3707.8120 CPU min MHz: 1500.0000 BogoMIPS: 4793.23 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d Virtualization: AMD-V L1d cache: 6 MiB (192 instances) L1i cache: 6 MiB (192 instances) L2 cache: 192 MiB (192 instances) L3 cache: 768 MiB (24 instances) NUMA node(s): 2 NUMA node0 CPU(s): 0-95 NUMA node1 CPU(s): 96-191 Vulnerability Gather data sampling: Not affected Vulnerability Itlb multihit: Not affected Vulnerability L1tf: Not affected Vulnerability Mds: Not affected Vulnerability Meltdown: Not affected Vulnerability Mmio stale data: Not affected Vulnerability Reg file data sampling: Not affected Vulnerability Retbleed: Not affected Vulnerability Spec rstack overflow: Mitigation; safe RET Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization Vulnerability Spectre v2: Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected Vulnerability Srbds: Not affected Vulnerability Tsx async abort: Not affected

Capturing CUDA graphs (decode, FULL): 0%| | 0/51 [00:00<?, ?it/s] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] EngineCore failed to start. (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] Traceback (most recent call last): (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] super().init( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in init (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model") (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._capture_cudagraphs( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._warmup_and_capture( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self._dummy_run( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] outputs = self.model( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self.runnable(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] model_output = self.model( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs) # type: ignore[arg-type] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_with_optional_nvtx_range( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return callable_fn(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] def forward( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return fn(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self.optimized_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] raise e (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "<eval_with_key>.66", line 496, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] raise e (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return forward_call(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "<eval_with_key>.2", line 6, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update); query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return self._op(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return func(*args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.impl.forward( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] self.unified_attention( (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel_unified_attention_2d[ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda> (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] kernel = self.compile(src, target=target, options=options.dict) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] next_module = compile_ir(module, metadata) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda> (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] pm.run(mod) (EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] RuntimeError: PassManager::run failed (EngineCore pid=270048) Process EngineCore: (EngineCore pid=270048) Traceback (most recent call last): (EngineCore pid=270048) File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap (EngineCore pid=270048) self.run() (EngineCore pid=270048) File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run (EngineCore pid=270048) self._target(*self._args, **self._kwargs) (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core (EngineCore pid=270048) engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in init (EngineCore pid=270048) super().init( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in init (EngineCore pid=270048) kv_cache_config = self._initialize_kv_caches(vllm_config) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches (EngineCore pid=270048) self.model_executor.initialize_from_config(kv_cache_configs) (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config (EngineCore pid=270048) compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model") (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc (EngineCore pid=270048) result = run_method(self.driver_worker, method, args, kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model (EngineCore pid=270048) cuda_graph_memory_bytes = self.model_runner.capture_model() (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model (EngineCore pid=270048) self._capture_cudagraphs( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs (EngineCore pid=270048) self._warmup_and_capture( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture (EngineCore pid=270048) self._dummy_run( (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run (EngineCore pid=270048) outputs = self.model( (EngineCore pid=270048) ^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in call (EngineCore pid=270048) return self.runnable(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward (EngineCore pid=270048) model_output = self.model( (EngineCore pid=270048) ^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in call (EngineCore pid=270048) return TorchCompileWithNoGuardsWrapper.call(self, *args, **kwargs) # type: ignore[arg-type] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in call (EngineCore pid=270048) return self._call_with_optional_nvtx_range( (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range (EngineCore pid=270048) return callable_fn(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward (EngineCore pid=270048) def forward( (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn (EngineCore pid=270048) return fn(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in call (EngineCore pid=270048) return self.optimized_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "<eval_with_key>.66", line 496, in forward (EngineCore pid=270048) submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3); getitem = getitem_1 = getitem_2 = submod_1 = None (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped (EngineCore pid=270048) return self._wrapped_call(self, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in call (EngineCore pid=270048) raise e (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in call (EngineCore pid=270048) return super(self.cls, obj).call(*args, **kwargs) # type: ignore[misc] (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl (EngineCore pid=270048) return self._call_impl(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl (EngineCore pid=270048) return forward_call(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "<eval_with_key>.2", line 6, in forward (EngineCore pid=270048) unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update); query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in call (EngineCore pid=270048) return self._op(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper (EngineCore pid=270048) return func(*args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output (EngineCore pid=270048) self.impl.forward( (EngineCore pid=270048) File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward (EngineCore pid=270048) self.unified_attention( (EngineCore pid=270048) File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention (EngineCore pid=270048) kernel_unified_attention_2d[ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda> (EngineCore pid=270048) return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run (EngineCore pid=270048) kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile (EngineCore pid=270048) kernel = self.compile(src, target=target, options=options.dict) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile (EngineCore pid=270048) next_module = compile_ir(module, metadata) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda> (EngineCore pid=270048) stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options) (EngineCore pid=270048) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore pid=270048) File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir (EngineCore pid=270048) pm.run(mod) (EngineCore pid=270048) RuntimeError: PassManager::run failed [rank0]:[W319 10:11:40.372379106 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator()) (APIServer pid=269504) Traceback (most recent call last): (APIServer pid=269504) File "/usr/local/bin/vllm", line 33, in <module> (APIServer pid=269504) sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')()) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/cli/main.py", line 75, in main (APIServer pid=269504) args.dispatch_function(args) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd (APIServer pid=269504) uvloop.run(run_server(args)) (APIServer pid=269504) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 96, in run (APIServer pid=269504) return __asyncio.run( (APIServer pid=269504) ^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run (APIServer pid=269504) return runner.run(main) (APIServer pid=269504) ^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run (APIServer pid=269504) return self._loop.run_until_complete(task) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete (APIServer pid=269504) File "/usr/local/lib/python3.12/dist-packages/uvloop/init.py", line 48, in wrapper (APIServer pid=269504) return await main (APIServer pid=269504) ^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 675, in run_server (APIServer pid=269504) await run_server_worker(listen_address, sock, args, **uvicorn_kwargs) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker (APIServer pid=269504) async with build_async_engine_client( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=269504) return await anext(self.gen) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client (APIServer pid=269504) async with build_async_engine_client_from_engine_args( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 210, in aenter (APIServer pid=269504) return await anext(self.gen) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args (APIServer pid=269504) async_llm = AsyncLLM.from_vllm_config( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config (APIServer pid=269504) return cls( (APIServer pid=269504) ^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 154, in init (APIServer pid=269504) self.engine_core = EngineCoreClient.make_async_mp_client( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=269504) return func(*args, **kwargs) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client (APIServer pid=269504) return AsyncMPClient(*client_args) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper (APIServer pid=269504) return func(*args, **kwargs) (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 924, in init (APIServer pid=269504) super().init( (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 583, in init (APIServer pid=269504) with launch_core_engines( (APIServer pid=269504) ^^^^^^^^^^^^^^^^^^^^ (APIServer pid=269504) File "/usr/lib/python3.12/contextlib.py", line 144, in exit (APIServer pid=269504) next(self.gen) (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines (APIServer pid=269504) wait_for_engine_startup( (APIServer pid=269504) File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup (APIServer pid=269504) raise RuntimeError( (APIServer pid=269504) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}

</details>

PR fix notes

PR #37606: [ROCm][Bugfix] fix cache block size mismatch for aiter unified attention

Description (problem / solution / changelog)

This PR fixes the following issue (Resolves https://github.com/vllm-project/vllm/issues/37548):

  • We want to use cache block size of 64 for AITER_UNIFIED_ATTENTION. The current logic works as intended when we use the env variable VLLM_ROCM_USE_AITER_UNIFIED_ATTENTION=1
  • However, cache block size still resolves to 16 when --attention-config '{"backend": "ROCM_AITER_UNIFIED_ATTN"}' is used.

This PR resolves this mismatch irrespective of how the AITER_UNIFIED_ATTN is set (via ENV var or as an arg) in accordance with the recent refactoring of check_and_update_config (https://github.com/vllm-project/vllm/pull/35122)

The remaining backends (ROCM_AITER_FA, ROCM_ATTN, TRITON_ATTN) will continue to default to cache block size = 16

Changed files

  • vllm/platforms/rocm.py (modified, +0/-24)
  • vllm/v1/attention/backends/rocm_aiter_unified_attn.py (modified, +7/-0)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version                : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e)
CMake version                : version 3.31.10
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+git8907517
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.0.51831-a3e329ad8

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-116-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :  (gfx942:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.0.51831
MIOpen runtime version       : 3.5.0
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9654 96-Core Processor
CPU family:                           25
Model:                                17
Thread(s) per core:                   1
Core(s) per socket:                   96
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          3707.8120
CPU min MHz:                          1500.0000
BogoMIPS:                             4793.23
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                       AMD-V
L1d cache:                            6 MiB (192 instances)
L1i cache:                            6 MiB (192 instances)
L2 cache:                             192 MiB (192 instances)
L3 cache:                             768 MiB (24 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-95
NUMA node1 CPU(s):                    96-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.86
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+git8907517
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] transformers==4.57.6
[pip3] triton==3.4.0
[pip3] triton_kernels==1.0.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.0.51831-a3e329ad8
vLLM Version                 : 0.1.dev14958+g8591b1ba6.d20260318 (git sha: 8591b1ba6, date: 20260318)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           15           15           15           15           15           
GPU1   15           0            15           15           15           15           15           15           
GPU2   15           15           0            15           15           15           15           15           
GPU3   15           15           15           0            15           15           15           15           
GPU4   15           15           15           15           0            15           15           15           
GPU5   15           15           15           15           15           0            15           15           
GPU6   15           15           15           15           15           15           0            15           
GPU7   15           15           15           15           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 0
GPU[2]		: (Topology) Numa Node: 0
GPU[2]		: (Topology) Numa Affinity: 0
GPU[3]		: (Topology) Numa Node: 0
GPU[3]		: (Topology) Numa Affinity: 0
GPU[4]		: (Topology) Numa Node: 1
GPU[4]		: (Topology) Numa Affinity: 1
GPU[5]		: (Topology) Numa Node: 1
GPU[5]		: (Topology) Numa Affinity: 1
GPU[6]		: (Topology) Numa Node: 1
GPU[6]		: (Topology) Numa Affinity: 1
GPU[7]		: (Topology) Numa Node: 1
GPU[7]		: (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1

---

VLLM_ROCM_USE_AITER=1 \
vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--no-enable-prefix-caching \
--port 9095 \
--kv-cache-dtype fp8 \
--attention-config '{"backend": "ROCM_AITER_UNIFIED_ATTN"}' \
> logs/server.log 2>&1

---

/app/ir-ops/aiter/aiter/ops/triton/_triton_kernels/attention/unified_attention.py:54:0: error: Failures have been detected while processing an MLIR pass pipeline
/app/ir-ops/aiter/aiter/ops/triton/_triton_kernels/attention/unified_attention.py:54:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`

Capturing CUDA graphs (decode, FULL):   0%|          | 0/51 [00:00<?, ?it/s]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] EngineCore failed to start.
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     super().__init__(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self._capture_cudagraphs(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self._warmup_and_capture(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self._dummy_run(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     outputs = self.model(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]               ^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self.runnable(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     model_output = self.model(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                    ^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_with_optional_nvtx_range(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return callable_fn(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     def forward(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self.optimized_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     raise e
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "<eval_with_key>.66", line 496, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     raise e
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "<eval_with_key>.2", line 6, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update);  query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self.impl.forward(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self.unified_attention(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kernel_unified_attention_2d[
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     next_module = compile_ir(module, metadata)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda>
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     pm.run(mod)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] RuntimeError: PassManager::run failed
(EngineCore pid=270048) Process EngineCore:
(EngineCore pid=270048) Traceback (most recent call last):
(EngineCore pid=270048)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=270048)     self.run()
(EngineCore pid=270048)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=270048)     self._target(*self._args, **self._kwargs)
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=270048)     raise e
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=270048)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=270048)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=270048)     super().__init__(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=270048)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=270048)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches
(EngineCore pid=270048)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore pid=270048)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore pid=270048)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=270048)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=270048)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model
(EngineCore pid=270048)     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore pid=270048)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model
(EngineCore pid=270048)     self._capture_cudagraphs(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs
(EngineCore pid=270048)     self._warmup_and_capture(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture
(EngineCore pid=270048)     self._dummy_run(
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run
(EngineCore pid=270048)     outputs = self.model(
(EngineCore pid=270048)               ^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in __call__
(EngineCore pid=270048)     return self.runnable(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048)     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048)     return forward_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward
(EngineCore pid=270048)     model_output = self.model(
(EngineCore pid=270048)                    ^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in __call__
(EngineCore pid=270048)     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in __call__
(EngineCore pid=270048)     return self._call_with_optional_nvtx_range(
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=270048)     return callable_fn(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward
(EngineCore pid=270048)     def forward(
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=270048)     return fn(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in __call__
(EngineCore pid=270048)     return self.optimized_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048)     raise e
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048)     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048)     return forward_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "<eval_with_key>.66", line 496, in forward
(EngineCore pid=270048)     submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore pid=270048)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048)     raise e
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048)     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048)     return forward_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "<eval_with_key>.2", line 6, in forward
(EngineCore pid=270048)     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update);  query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None
(EngineCore pid=270048)                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=270048)     return self._op(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output
(EngineCore pid=270048)     self.impl.forward(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward
(EngineCore pid=270048)     self.unified_attention(
(EngineCore pid=270048)   File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention
(EngineCore pid=270048)     kernel_unified_attention_2d[
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
(EngineCore pid=270048)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=270048)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run
(EngineCore pid=270048)     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=270048)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile
(EngineCore pid=270048)     kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore pid=270048)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile
(EngineCore pid=270048)     next_module = compile_ir(module, metadata)
(EngineCore pid=270048)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda>
(EngineCore pid=270048)     stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options)
(EngineCore pid=270048)                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir
(EngineCore pid=270048)     pm.run(mod)
(EngineCore pid=270048) RuntimeError: PassManager::run failed
[rank0]:[W319 10:11:40.372379106 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=269504) Traceback (most recent call last):
(APIServer pid=269504)   File "/usr/local/bin/vllm", line 33, in <module>
(APIServer pid=269504)     sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
(APIServer pid=269504)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=269504)     args.dispatch_function(args)
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=269504)     uvloop.run(run_server(args))
(APIServer pid=269504)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=269504)     return __asyncio.run(
(APIServer pid=269504)            ^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=269504)     return runner.run(main)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=269504)     return self._loop.run_until_complete(task)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=269504)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=269504)     return await main
(APIServer pid=269504)            ^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 675, in run_server
(APIServer pid=269504)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker
(APIServer pid=269504)     async with build_async_engine_client(
(APIServer pid=269504)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=269504)     return await anext(self.gen)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client
(APIServer pid=269504)     async with build_async_engine_client_from_engine_args(
(APIServer pid=269504)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=269504)     return await anext(self.gen)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args
(APIServer pid=269504)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=269504)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=269504)     return cls(
(APIServer pid=269504)            ^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=269504)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=269504)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=269504)     return func(*args, **kwargs)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=269504)     return AsyncMPClient(*client_args)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=269504)     return func(*args, **kwargs)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=269504)     super().__init__(
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=269504)     with launch_core_engines(
(APIServer pid=269504)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=269504)     next(self.gen)
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=269504)     wait_for_engine_startup(
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=269504)     raise RuntimeError(
(APIServer pid=269504) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
</details>

Additionally, all `test_attention_quant_pattern` are failing failing for `AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN`
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version                : 20.0.0git (https://github.com/RadeonOpenCompute/llvm-project roc-7.0.0 25314 f4087f6b428f0e6f575ebac8a8a724dab123d06e)
CMake version                : version 3.31.10
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+git8907517
Is debug build               : False
CUDA used to build PyTorch   : N/A
ROCM used to build PyTorch   : 7.0.51831-a3e329ad8

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-116-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : True
CUDA runtime version         : Could not collect
CUDA_MODULE_LOADING set to   : 
GPU models and configuration :  (gfx942:sramecc+:xnack-)
Nvidia driver version        : Could not collect
cuDNN version                : Could not collect
HIP runtime version          : 7.0.51831
MIOpen runtime version       : 3.5.0
Is XNNPACK available         : True

==============================
          CPU Info
==============================
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               192
On-line CPU(s) list:                  0-191
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9654 96-Core Processor
CPU family:                           25
Model:                                17
Thread(s) per core:                   1
Core(s) per socket:                   96
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          3707.8120
CPU min MHz:                          1500.0000
BogoMIPS:                             4793.23
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                       AMD-V
L1d cache:                            6 MiB (192 instances)
L1i cache:                            6 MiB (192 instances)
L2 cache:                             192 MiB (192 instances)
L3 cache:                             768 MiB (24 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-95
NUMA node1 CPU(s):                    96-191
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

==============================
Versions of relevant libraries
==============================
[pip3] conch-triton-kernels==1.2.1
[pip3] numpy==2.1.3
[pip3] onnx==1.19.0
[pip3] onnx-ir==0.2.0
[pip3] onnxscript==0.6.2
[pip3] onnxslim==0.1.86
[pip3] pyzmq==27.1.0
[pip3] torch==2.9.1+git8907517
[pip3] torchaudio==2.9.0+eaa9e4e
[pip3] torchvision==0.24.1+d801a34
[pip3] transformers==4.57.6
[pip3] triton==3.4.0
[pip3] triton_kernels==1.0.0
[conda] Could not collect

==============================
         vLLM Info
==============================
ROCM Version                 : 7.0.51831-a3e329ad8
vLLM Version                 : 0.1.dev14958+g8591b1ba6.d20260318 (git sha: 8591b1ba6, date: 20260318)
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled
GPU Topology:
  ============================ ROCm System Management Interface ============================
================================ Weight between two GPUs =================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            15           15           15           15           15           15           15           
GPU1   15           0            15           15           15           15           15           15           
GPU2   15           15           0            15           15           15           15           15           
GPU3   15           15           15           0            15           15           15           15           
GPU4   15           15           15           15           0            15           15           15           
GPU5   15           15           15           15           15           0            15           15           
GPU6   15           15           15           15           15           15           0            15           
GPU7   15           15           15           15           15           15           15           0            

================================= Hops between two GPUs ==================================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            1            1            1            1            1            1            1            
GPU1   1            0            1            1            1            1            1            1            
GPU2   1            1            0            1            1            1            1            1            
GPU3   1            1            1            0            1            1            1            1            
GPU4   1            1            1            1            0            1            1            1            
GPU5   1            1            1            1            1            0            1            1            
GPU6   1            1            1            1            1            1            0            1            
GPU7   1            1            1            1            1            1            1            0            

=============================== Link Type between two GPUs ===============================
       GPU0         GPU1         GPU2         GPU3         GPU4         GPU5         GPU6         GPU7         
GPU0   0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU1   XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         
GPU2   XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         XGMI         
GPU3   XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         XGMI         
GPU4   XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         XGMI         
GPU5   XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         XGMI         
GPU6   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            XGMI         
GPU7   XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         XGMI         0            

======================================= Numa Nodes =======================================
GPU[0]		: (Topology) Numa Node: 0
GPU[0]		: (Topology) Numa Affinity: 0
GPU[1]		: (Topology) Numa Node: 0
GPU[1]		: (Topology) Numa Affinity: 0
GPU[2]		: (Topology) Numa Node: 0
GPU[2]		: (Topology) Numa Affinity: 0
GPU[3]		: (Topology) Numa Node: 0
GPU[3]		: (Topology) Numa Affinity: 0
GPU[4]		: (Topology) Numa Node: 1
GPU[4]		: (Topology) Numa Affinity: 1
GPU[5]		: (Topology) Numa Node: 1
GPU[5]		: (Topology) Numa Affinity: 1
GPU[6]		: (Topology) Numa Node: 1
GPU[6]		: (Topology) Numa Affinity: 1
GPU[7]		: (Topology) Numa Node: 1
GPU[7]		: (Topology) Numa Affinity: 1
================================== End of ROCm SMI Log ===================================

==============================
     Environment Variables
==============================
PYTORCH_TUNABLEOP_TUNING=0
PYTORCH_TUNABLEOP_ENABLED=1
PYTORCH_ROCM_ARCH=gfx90a;gfx942;gfx950;gfx1100;gfx1101;gfx1200;gfx1201;gfx1150;gfx1151
LD_LIBRARY_PATH=/opt/rocm/lib:/usr/local/lib:
PYTORCH_TUNABLEOP_FILENAME=/app/afo_tune_device_%d_full.csv
PYTORCH_NVML_BASED_CUDA_CHECK=1
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

When running the following serving command

VLLM_ROCM_USE_AITER=1 \
vllm serve RedHatAI/Meta-Llama-3.1-8B-Instruct-FP8 \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--no-enable-prefix-caching \
--port 9095 \
--kv-cache-dtype fp8 \
--attention-config '{"backend": "ROCM_AITER_UNIFIED_ATTN"}' \
> logs/server.log 2>&1

vllm fails due to failures while processing an MLIR pass pipeline. The full log is shown below

<details>
/app/ir-ops/aiter/aiter/ops/triton/_triton_kernels/attention/unified_attention.py:54:0: error: Failures have been detected while processing an MLIR pass pipeline
/app/ir-ops/aiter/aiter/ops/triton/_triton_kernels/attention/unified_attention.py:54:0: note: Pipeline failed while executing [`ConvertTritonAMDGPUToLLVM` on 'builtin.module' operation]: reproducer generated at `std::errs, please share the reproducer above with Triton project.`

Capturing CUDA graphs (decode, FULL):   0%|          | 0/51 [00:00<?, ?it/s]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] EngineCore failed to start.
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] Traceback (most recent call last):
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     super().__init__(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self._capture_cudagraphs(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self._warmup_and_capture(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self._dummy_run(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     outputs = self.model(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]               ^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self.runnable(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     model_output = self.model(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                    ^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_with_optional_nvtx_range(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return callable_fn(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     def forward(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return fn(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self.optimized_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     raise e
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "<eval_with_key>.66", line 496, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     raise e
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return forward_call(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "<eval_with_key>.2", line 6, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update);  query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return self._op(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return func(*args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self.impl.forward(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     self.unified_attention(
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kernel_unified_attention_2d[
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     next_module = compile_ir(module, metadata)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda>
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099]     pm.run(mod)
(EngineCore pid=270048) ERROR 03-19 10:11:39 [core.py:1099] RuntimeError: PassManager::run failed
(EngineCore pid=270048) Process EngineCore:
(EngineCore pid=270048) Traceback (most recent call last):
(EngineCore pid=270048)   File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
(EngineCore pid=270048)     self.run()
(EngineCore pid=270048)   File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
(EngineCore pid=270048)     self._target(*self._args, **self._kwargs)
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1103, in run_engine_core
(EngineCore pid=270048)     raise e
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 1073, in run_engine_core
(EngineCore pid=270048)     engine_core = EngineCoreProc(*args, engine_index=dp_rank, **kwargs)
(EngineCore pid=270048)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 839, in __init__
(EngineCore pid=270048)     super().__init__(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 122, in __init__
(EngineCore pid=270048)     kv_cache_config = self._initialize_kv_caches(vllm_config)
(EngineCore pid=270048)                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/engine/core.py", line 278, in _initialize_kv_caches
(EngineCore pid=270048)     self.model_executor.initialize_from_config(kv_cache_configs)
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/executor/abstract.py", line 118, in initialize_from_config
(EngineCore pid=270048)     compilation_times: list[float] = self.collective_rpc("compile_or_warm_up_model")
(EngineCore pid=270048)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/executor/uniproc_executor.py", line 78, in collective_rpc
(EngineCore pid=270048)     result = run_method(self.driver_worker, method, args, kwargs)
(EngineCore pid=270048)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/serial_utils.py", line 459, in run_method
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_worker.py", line 608, in compile_or_warm_up_model
(EngineCore pid=270048)     cuda_graph_memory_bytes = self.model_runner.capture_model()
(EngineCore pid=270048)                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5744, in capture_model
(EngineCore pid=270048)     self._capture_cudagraphs(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5855, in _capture_cudagraphs
(EngineCore pid=270048)     self._warmup_and_capture(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5791, in _warmup_and_capture
(EngineCore pid=270048)     self._dummy_run(
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/worker/gpu_model_runner.py", line 5235, in _dummy_run
(EngineCore pid=270048)     outputs = self.model(
(EngineCore pid=270048)               ^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/cuda_graph.py", line 251, in __call__
(EngineCore pid=270048)     return self.runnable(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048)     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048)     return forward_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 577, in forward
(EngineCore pid=270048)     model_output = self.model(
(EngineCore pid=270048)                    ^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/decorators.py", line 503, in __call__
(EngineCore pid=270048)     return TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)  # type: ignore[arg-type]
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 187, in __call__
(EngineCore pid=270048)     return self._call_with_optional_nvtx_range(
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/wrapper.py", line 76, in _call_with_optional_nvtx_range
(EngineCore pid=270048)     return callable_fn(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/models/llama.py", line 400, in forward
(EngineCore pid=270048)     def forward(
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/eval_frame.py", line 1044, in _fn
(EngineCore pid=270048)     return fn(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/compilation/caching.py", line 206, in __call__
(EngineCore pid=270048)     return self.optimized_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048)     raise e
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048)     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048)     return forward_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "<eval_with_key>.66", line 496, in forward
(EngineCore pid=270048)     submod_1 = self.submod_1(getitem, s72, getitem_1, getitem_2, getitem_3);  getitem = getitem_1 = getitem_2 = submod_1 = None
(EngineCore pid=270048)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 837, in call_wrapped
(EngineCore pid=270048)     return self._wrapped_call(self, *args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 413, in __call__
(EngineCore pid=270048)     raise e
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/fx/graph_module.py", line 400, in __call__
(EngineCore pid=270048)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
(EngineCore pid=270048)     return self._call_impl(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1786, in _call_impl
(EngineCore pid=270048)     return forward_call(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "<eval_with_key>.2", line 6, in forward
(EngineCore pid=270048)     unified_attention_with_output = torch.ops.vllm.unified_attention_with_output(query, key, value, output_3, 'model.layers.0.self_attn.attn', kv_cache_dummy_dep = unified_kv_cache_update);  query = key = value = output_3 = unified_kv_cache_update = unified_attention_with_output = None
(EngineCore pid=270048)                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1255, in __call__
(EngineCore pid=270048)     return self._op(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 39, in wrapper
(EngineCore pid=270048)     return func(*args, **kwargs)
(EngineCore pid=270048)            ^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/model_executor/layers/attention/attention.py", line 699, in unified_attention_with_output
(EngineCore pid=270048)     self.impl.forward(
(EngineCore pid=270048)   File "/app/ir-ops/vllm/vllm/v1/attention/backends/rocm_aiter_unified_attn.py", line 220, in forward
(EngineCore pid=270048)     self.unified_attention(
(EngineCore pid=270048)   File "/app/ir-ops/aiter/aiter/ops/triton/attention/unified_attention.py", line 185, in unified_attention
(EngineCore pid=270048)     kernel_unified_attention_2d[
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 393, in <lambda>
(EngineCore pid=270048)     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(EngineCore pid=270048)                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 599, in run
(EngineCore pid=270048)     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(EngineCore pid=270048)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 782, in _do_compile
(EngineCore pid=270048)     kernel = self.compile(src, target=target, options=options.__dict__)
(EngineCore pid=270048)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 322, in compile
(EngineCore pid=270048)     next_module = compile_ir(module, metadata)
(EngineCore pid=270048)                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 449, in <lambda>
(EngineCore pid=270048)     stages["llir"] = lambda src, metadata: self.make_llir(src, metadata, options)
(EngineCore pid=270048)                                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore pid=270048)   File "/usr/local/lib/python3.12/dist-packages/triton/backends/amd/compiler.py", line 324, in make_llir
(EngineCore pid=270048)     pm.run(mod)
(EngineCore pid=270048) RuntimeError: PassManager::run failed
[rank0]:[W319 10:11:40.372379106 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
(APIServer pid=269504) Traceback (most recent call last):
(APIServer pid=269504)   File "/usr/local/bin/vllm", line 33, in <module>
(APIServer pid=269504)     sys.exit(load_entry_point('vllm', 'console_scripts', 'vllm')())
(APIServer pid=269504)              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/cli/main.py", line 75, in main
(APIServer pid=269504)     args.dispatch_function(args)
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/cli/serve.py", line 118, in cmd
(APIServer pid=269504)     uvloop.run(run_server(args))
(APIServer pid=269504)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 96, in run
(APIServer pid=269504)     return __asyncio.run(
(APIServer pid=269504)            ^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/asyncio/runners.py", line 195, in run
(APIServer pid=269504)     return runner.run(main)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/asyncio/runners.py", line 118, in run
(APIServer pid=269504)     return self._loop.run_until_complete(task)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
(APIServer pid=269504)   File "/usr/local/lib/python3.12/dist-packages/uvloop/__init__.py", line 48, in wrapper
(APIServer pid=269504)     return await main
(APIServer pid=269504)            ^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 675, in run_server
(APIServer pid=269504)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 689, in run_server_worker
(APIServer pid=269504)     async with build_async_engine_client(
(APIServer pid=269504)                ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=269504)     return await anext(self.gen)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 104, in build_async_engine_client
(APIServer pid=269504)     async with build_async_engine_client_from_engine_args(
(APIServer pid=269504)                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/contextlib.py", line 210, in __aenter__
(APIServer pid=269504)     return await anext(self.gen)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/entrypoints/openai/api_server.py", line 145, in build_async_engine_client_from_engine_args
(APIServer pid=269504)     async_llm = AsyncLLM.from_vllm_config(
(APIServer pid=269504)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 225, in from_vllm_config
(APIServer pid=269504)     return cls(
(APIServer pid=269504)            ^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/async_llm.py", line 154, in __init__
(APIServer pid=269504)     self.engine_core = EngineCoreClient.make_async_mp_client(
(APIServer pid=269504)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=269504)     return func(*args, **kwargs)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 128, in make_async_mp_client
(APIServer pid=269504)     return AsyncMPClient(*client_args)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/tracing/otel.py", line 178, in sync_wrapper
(APIServer pid=269504)     return func(*args, **kwargs)
(APIServer pid=269504)            ^^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 924, in __init__
(APIServer pid=269504)     super().__init__(
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/core_client.py", line 583, in __init__
(APIServer pid=269504)     with launch_core_engines(
(APIServer pid=269504)          ^^^^^^^^^^^^^^^^^^^^
(APIServer pid=269504)   File "/usr/lib/python3.12/contextlib.py", line 144, in __exit__
(APIServer pid=269504)     next(self.gen)
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 972, in launch_core_engines
(APIServer pid=269504)     wait_for_engine_startup(
(APIServer pid=269504)   File "/app/ir-ops/vllm/vllm/v1/engine/utils.py", line 1031, in wait_for_engine_startup
(APIServer pid=269504)     raise RuntimeError(
(APIServer pid=269504) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}
</details>

Additionally, all `test_attention_quant_pattern` are failing failing for `AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN`
```log
=========================== short test summary info ============================
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype0-8-128-32-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype0-8-128-40-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype1-8-128-32-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel-+quant_fp8-dtype1-8-128-40-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel--quant_fp8-dtype0-8-128-32-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel--quant_fp8-dtype0-8-128-40-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel--quant_fp8-dtype1-8-128-32-8]
FAILED tests/compile/passes/test_fusion_attn.py::test_attention_quant_pattern[AttentionBackendEnum.ROCM_AITER_UNIFIED_ATTN-amd/Llama-3.1-8B-Instruct-FP8-KV-TestAttentionFp8StaticQuantPatternModel--quant_fp8-dtype1-8-128-40-8]
============= 8 failed, 16 passed, 7 warnings in 116.00s (0:01:56) =============

The issue arrises from a recent PR (#36927) that attempted to quantize the query activations to fp8 in ROCm aiter unified attention. Reverting the changes in this PR solves the issue.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to revert the changes made in PR #36927 that attempted to quantize the query activations to fp8 in ROCm aiter unified attention.

Here are the steps to follow:

  • Revert the changes made in PR #36927.
  • Update the unified_attention.py file to remove the quantization of query activations to fp8.
  • Recompile the code after reverting the changes.

Example code changes:

# Before (with quantization)
query = torch.ops.vllm.quantize(query, dtype=torch.float8)

# After (without quantization)
query = query

In the unified_attention.py file, remove the line that quantizes the query activations to fp8.

Verification

To verify that the fix worked, run the following commands:

  • Run the vllm serve command with the same arguments as before.
  • Check the logs to see if the error message is gone.
  • Run the test_attention_quant_pattern tests to see if they pass.

Extra Tips

  • Make sure to test the code thoroughly after reverting the changes to ensure that the fix did not introduce any new issues.
  • If the issue persists, try to debug the code to see where the error is coming from.
  • Consider adding more tests to cover the quantization of query activations to fp8 in ROCm aiter unified attention.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING