vllm - 💡(How to fix) Fix [Bug]: Triton MLA decode kernel shape mismatch for Mistral-Small on ROCm when TP > 1 [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40966Fetched 2026-04-28 06:26:10
View on GitHub
Comments
2
Participants
2
Timeline
14
Reactions
0
Author
Timeline (top)
mentioned ×4subscribed ×4commented ×2labeled ×2

Error Message

(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 431, in forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] hidden_states = self.language_model.model( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 480, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.aot_compiled_fn(self, *args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.fn(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1244, in forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] def forward( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.optimized_call(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "<string>", line 14, in execution_fn (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "<eval_with_key>.116", line 5, in forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output_2, 'language_model.model.layers.0.self_attn.attn', kv_cache_dummy_dep = kv_cache_dummy_dep); q_1 = kv_c_normed = key_rot_1 = output_2 = kv_cache_dummy_dep = unified_mla_attention_with_output = None (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self._op(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 983, in unified_mla_attention_with_output (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] layer.forward_impl( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 698, in forward_impl (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py", line 196, in forward_mqa (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] decode_attention_fwd( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 762, in decode_attention_fwd (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] decode_attention_fwd_grouped( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 696, in decode_attention_fwd_grouped (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] _decode_grouped_att_m_fwd( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 500, in _decode_grouped_att_m_fwd (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] _fwd_grouped_kernel_stage1[grid]( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda> (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 720, in run (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 849, in _do_compile (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] kernel = self.compile(src, target=target, options=options.dict) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 304, in compile (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] module = src.make_ir(target, options, codegen_fns, module_map, context) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] triton.compiler.errors.CompilationError: at 152:12: (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] v = (v.to(tl.float32) * vs).to(q.dtype) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] else: (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] # MLA uses a single c_kv. (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] # loading the same c_kv to interpret it as v is not necessary. (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] # transpose the existing c_kv (aka k) for the dot product. (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] v = tl.trans(k) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] n_e_max = tl.maximum(tl.max(qk, 1), e_max) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] re_scale = tl.exp(e_max - n_e_max) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] p = tl.exp(qk - n_e_max[:, None]) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] acc *= re_scale[:, None] (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] acc += tl.dot(p.to(v.dtype), v) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512') (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] Traceback (most recent call last): (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] output = func(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.worker.execute_model(scheduler_output) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 811, in execute_model (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] output = self.model_runner.execute_model( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4026, in execute_model (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] model_output = self._model_forward( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3507, in _model_forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.model( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.runnable(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self._call_impl(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return forward_call(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 431, in forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] hidden_states = self.language_model.model( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 480, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.aot_compiled_fn(self, *args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.fn(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1244, in forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] def forward( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self.optimized_call(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "<string>", line 14, in execution_fn (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "<eval_with_key>.116", line 5, in forward (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output_2, 'language_model.model.layers.0.self_attn.attn', kv_cache_dummy_dep = kv_cache_dummy_dep); q_1 = kv_c_normed = key_rot_1 = output_2 = kv_cache_dummy_dep = unified_mla_attention_with_output = None (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in call (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return self._op(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return func(*args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 983, in unified_mla_attention_with_output (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] layer.forward_impl( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 698, in forward_impl (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py", line 196, in forward_mqa (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] decode_attention_fwd( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 762, in decode_attention_fwd (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] decode_attention_fwd_grouped( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 696, in decode_attention_fwd_grouped (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] _decode_grouped_att_m_fwd( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 500, in _decode_grouped_att_m_fwd (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] _fwd_grouped_kernel_stage1[grid]( (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda> (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 720, in run (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 849, in _do_compile (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] kernel = self.compile(src, target=target, options=options.dict) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 304, in compile (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] module = src.make_ir(target, options, codegen_fns, module_map, context) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns, (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] triton.compiler.errors.CompilationError: at 152:12: (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] v = (v.to(tl.float32) * vs).to(q.dtype) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] else: (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] # MLA uses a single c_kv. (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] # loading the same c_kv to interpret it as v is not necessary. (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] # transpose the existing c_kv (aka k) for the dot product. (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] v = tl.trans(k) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] n_e_max = tl.maximum(tl.max(qk, 1), e_max) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] re_scale = tl.exp(e_max - n_e_max) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] p = tl.exp(qk - n_e_max[:, None]) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] acc *= re_scale[:, None] (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] acc += tl.dot(p.to(v.dtype), v) (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ^ (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512') (Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] (Worker_TP1 pid=4504) INFO 04-22 11:16:00 [multiproc_executor.py:881] WorkerProc shutting down. (APIServer pid=3952) INFO: Waiting for application shutdown. (APIServer pid=3952) INFO: Application shutdown complete. (APIServer pid=3952) INFO: Finished server process [3952] /usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown warnings.warn('resource_tracker: There appear to be %d '

Code Example

Your output of `python collect_env.py` here

---

export VLLM_USE_V1=1
vllm serve /app/model/models--mistralai--Mistral-Small-4-119B-2603/snapshots/8563dea9670952202c9b76635b3f444a2fb40973 \
   --tensor-parallel-size 2 \
   --max-model-len 32768 \
   --gpu-memory-utilization 0.90 \
   --port 8800 \
   --trust-remote-code \
   --enable-prefix-caching \
   --enable-chunked-prefill \
   --max-num-seqs 128 \
   --max-num-batched-tokens 8192 \
   --enable-auto-tool-choice \
   --tool-call-parser mistral \
--reasoning-parser mistral  \
2>&1 | tee vllm_debug_Mistral.log

---

(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 431, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     hidden_states = self.language_model.model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 480, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.fn(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1244, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     def forward(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.optimized_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<string>", line 14, in execution_fn
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<eval_with_key>.116", line 5, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output_2, 'language_model.model.layers.0.self_attn.attn', kv_cache_dummy_dep = kv_cache_dummy_dep);  q_1 = kv_c_normed = key_rot_1 = output_2 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._op(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 983, in unified_mla_attention_with_output
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     layer.forward_impl(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 698, in forward_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py", line 196, in forward_mqa
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 762, in decode_attention_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd_grouped(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 696, in decode_attention_fwd_grouped
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _decode_grouped_att_m_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 500, in _decode_grouped_att_m_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _fwd_grouped_kernel_stage1[grid](
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda>
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 720, in run
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 849, in _do_compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self.compile(src, target=target, options=options.__dict__)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 304, in compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     module = src.make_ir(target, options, codegen_fns, module_map, context)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] triton.compiler.errors.CompilationError: at 152:12:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     v = (v.to(tl.float32) * vs).to(q.dtype)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             else:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # MLA uses a single c_kv.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # loading the same c_kv to interpret it as v is not necessary.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # transpose the existing c_kv (aka k) for the dot product.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 v = tl.trans(k)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] 
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             re_scale = tl.exp(e_max - n_e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             p = tl.exp(qk - n_e_max[:, None])
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc *= re_scale[:, None]
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc += tl.dot(p.to(v.dtype), v)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             ^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512')
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] Traceback (most recent call last):
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     output = func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.worker.execute_model(scheduler_output)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 811, in execute_model
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     output = self.model_runner.execute_model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4026, in execute_model
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     model_output = self._model_forward(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                    ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3507, in _model_forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.runnable(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 431, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     hidden_states = self.language_model.model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 480, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.fn(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1244, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     def forward(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.optimized_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<string>", line 14, in execution_fn
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<eval_with_key>.116", line 5, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output_2, 'language_model.model.layers.0.self_attn.attn', kv_cache_dummy_dep = kv_cache_dummy_dep);  q_1 = kv_c_normed = key_rot_1 = output_2 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._op(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 983, in unified_mla_attention_with_output
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     layer.forward_impl(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 698, in forward_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py", line 196, in forward_mqa
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 762, in decode_attention_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd_grouped(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 696, in decode_attention_fwd_grouped
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _decode_grouped_att_m_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 500, in _decode_grouped_att_m_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _fwd_grouped_kernel_stage1[grid](
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda>
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 720, in run
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 849, in _do_compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self.compile(src, target=target, options=options.__dict__)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 304, in compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     module = src.make_ir(target, options, codegen_fns, module_map, context)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] triton.compiler.errors.CompilationError: at 152:12:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     v = (v.to(tl.float32) * vs).to(q.dtype)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             else:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # MLA uses a single c_kv.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # loading the same c_kv to interpret it as v is not necessary.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # transpose the existing c_kv (aka k) for the dot product.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 v = tl.trans(k)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] 
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             re_scale = tl.exp(e_max - n_e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             p = tl.exp(qk - n_e_max[:, None])
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc *= re_scale[:, None]
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc += tl.dot(p.to(v.dtype), v)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             ^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512')
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] 
(Worker_TP1 pid=4504) INFO 04-22 11:16:00 [multiproc_executor.py:881] WorkerProc shutting down.
(APIServer pid=3952) INFO:     Waiting for application shutdown.
(APIServer pid=3952) INFO:     Application shutdown complete.
(APIServer pid=3952) INFO:     Finished server process [3952]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Reproduce command

export VLLM_USE_V1=1
vllm serve /app/model/models--mistralai--Mistral-Small-4-119B-2603/snapshots/8563dea9670952202c9b76635b3f444a2fb40973 \
   --tensor-parallel-size 2 \
   --max-model-len 32768 \
   --gpu-memory-utilization 0.90 \
   --port 8800 \
   --trust-remote-code \
   --enable-prefix-caching \
   --enable-chunked-prefill \
   --max-num-seqs 128 \
   --max-num-batched-tokens 8192 \
   --enable-auto-tool-choice \
   --tool-call-parser mistral \
--reasoning-parser mistral  \
2>&1 | tee vllm_debug_Mistral.log

Specific error message

(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 431, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     hidden_states = self.language_model.model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 480, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.fn(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1244, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     def forward(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.optimized_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<string>", line 14, in execution_fn
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<eval_with_key>.116", line 5, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output_2, 'language_model.model.layers.0.self_attn.attn', kv_cache_dummy_dep = kv_cache_dummy_dep);  q_1 = kv_c_normed = key_rot_1 = output_2 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._op(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 983, in unified_mla_attention_with_output
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     layer.forward_impl(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 698, in forward_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py", line 196, in forward_mqa
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 762, in decode_attention_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd_grouped(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 696, in decode_attention_fwd_grouped
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _decode_grouped_att_m_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 500, in _decode_grouped_att_m_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _fwd_grouped_kernel_stage1[grid](
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda>
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 720, in run
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 849, in _do_compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self.compile(src, target=target, options=options.__dict__)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 304, in compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     module = src.make_ir(target, options, codegen_fns, module_map, context)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] triton.compiler.errors.CompilationError: at 152:12:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     v = (v.to(tl.float32) * vs).to(q.dtype)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             else:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # MLA uses a single c_kv.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # loading the same c_kv to interpret it as v is not necessary.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # transpose the existing c_kv (aka k) for the dot product.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 v = tl.trans(k)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] 
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             re_scale = tl.exp(e_max - n_e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             p = tl.exp(qk - n_e_max[:, None])
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc *= re_scale[:, None]
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc += tl.dot(p.to(v.dtype), v)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             ^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512')
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] Traceback (most recent call last):
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/executor/multiproc_executor.py", line 966, in worker_busy_loop
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     output = func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/worker_base.py", line 337, in execute_model
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.worker.execute_model(scheduler_output)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_worker.py", line 811, in execute_model
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     output = self.model_runner.execute_model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 124, in decorate_context
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 4026, in execute_model
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     model_output = self._model_forward(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                    ^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/worker/gpu_model_runner.py", line 3507, in _model_forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/cuda_graph.py", line 254, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.runnable(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1776, in _wrapped_call_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._call_impl(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/nn/modules/module.py", line 1787, in _call_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return forward_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/pixtral.py", line 431, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     hidden_states = self.language_model.model(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/decorators.py", line 480, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.aot_compiled_fn(self, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_dynamo/aot_compile.py", line 124, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.fn(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/deepseek_v2.py", line 1244, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     def forward(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/compilation/caching.py", line 215, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self.optimized_call(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<string>", line 14, in execution_fn
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "<eval_with_key>.116", line 5, in forward
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     unified_mla_attention_with_output = torch.ops.vllm.unified_mla_attention_with_output(q_1, kv_c_normed, key_rot_1, output_2, 'language_model.model.layers.0.self_attn.attn', kv_cache_dummy_dep = kv_cache_dummy_dep);  q_1 = kv_c_normed = key_rot_1 = output_2 = kv_cache_dummy_dep = unified_mla_attention_with_output = None
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 1209, in __call__
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return self._op(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/kv_transfer_utils.py", line 40, in wrapper
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return func(*args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 983, in unified_mla_attention_with_output
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     layer.forward_impl(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/model_executor/layers/attention/mla_attention.py", line 698, in forward_impl
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     attn_out, lse = self.impl.forward_mqa(mqa_q, kv_cache, attn_metadata, self)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/backends/mla/triton_mla.py", line 196, in forward_mqa
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 762, in decode_attention_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     decode_attention_fwd_grouped(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 696, in decode_attention_fwd_grouped
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _decode_grouped_att_m_fwd(
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/attention/ops/triton_decode_attention.py", line 500, in _decode_grouped_att_m_fwd
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     _fwd_grouped_kernel_stage1[grid](
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 370, in <lambda>
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 720, in run
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self._do_compile(key, signature, device, constexprs, options, attrs, warmup)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/runtime/jit.py", line 849, in _do_compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     kernel = self.compile(src, target=target, options=options.__dict__)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 304, in compile
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     module = src.make_ir(target, options, codegen_fns, module_map, context)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]   File "/usr/local/lib/python3.12/dist-packages/triton/compiler/compiler.py", line 80, in make_ir
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]     return ast_to_ttir(self.fn, self, context=context, options=options, codegen_fns=codegen_fns,
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] triton.compiler.errors.CompilationError: at 152:12:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                     v = (v.to(tl.float32) * vs).to(q.dtype)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             else:
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # MLA uses a single c_kv.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # loading the same c_kv to interpret it as v is not necessary.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 # transpose the existing c_kv (aka k) for the dot product.
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]                 v = tl.trans(k)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] 
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             n_e_max = tl.maximum(tl.max(qk, 1), e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             re_scale = tl.exp(e_max - n_e_max)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             p = tl.exp(qk - n_e_max[:, None])
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc *= re_scale[:, None]
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             acc += tl.dot(p.to(v.dtype), v)
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971]             ^
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512')
(Worker_TP1 pid=4504) ERROR 04-22 11:16:00 [multiproc_executor.py:971] 
(Worker_TP1 pid=4504) INFO 04-22 11:16:00 [multiproc_executor.py:881] WorkerProc shutting down.
(APIServer pid=3952) INFO:     Waiting for application shutdown.
(APIServer pid=3952) INFO:     Application shutdown complete.
(APIServer pid=3952) INFO:     Finished server process [3952]
/usr/lib/python3.12/multiprocessing/resource_tracker.py:279: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The error message ValueError('Cannot make_shape_compatible: incompatible dimensions at index 1: 256 and 512') suggests a shape mismatch issue in the model, which may be caused by incorrect configuration or data processing.

Guidance

  • Check the model configuration and data processing pipeline to ensure that the input shapes are compatible with the model's expectations.
  • Verify that the --max-model-len and --max-num-batched-tokens arguments are set correctly for the model being used.
  • Review the model's documentation and implementation to understand the expected input shapes and any potential limitations.
  • Consider reducing the batch size or sequence length to see if the issue persists, which may help identify the root cause.

Notes

The provided error message and stack trace suggest a complex issue related to the model's internal implementation and the interaction with the Triton library. Without more specific information about the model and its configuration, it is challenging to provide a more detailed solution.

Recommendation

Apply a workaround by adjusting the model configuration or input data to match the expected shapes, and monitor the issue to see if it resolves the problem. If the issue persists, further investigation into the model's implementation and the interaction with the Triton library may be necessary.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING