vllm - 💡(How to fix) Fix [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38406Fetched 2026-04-08 01:41:34
View on GitHub
Comments
3
Participants
2
Timeline
21
Reactions
0
Timeline (top)
mentioned ×7subscribed ×7commented ×3labeled ×2

Error Message

Fix Action

Fix / Workaround

even after the patch with export VLLM_ROCM_USE_AITER=1 https://github.com/vllm-project/vllm/issues/35641 that enabled AITER MLA for Kimi TP8, the performance is still unfortunate way slower for MI325 than H200. on MI355, 0.18 with aiter on is signficiantly better than 0.16 but on Mi325 this is not the case.

Code Example

set -x
export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--trust-remote-code \
--max-num-seqs 256 \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &
RAW_BUFFERClick to expand / collapse

Your current environment

vllm/vllm-openai-rocm:v0.18.0

🐛 Describe the bug

hi @powderluv @chunfangamd @andyluo7 @hongxiayang

even after the patch with export VLLM_ROCM_USE_AITER=1 https://github.com/vllm-project/vllm/issues/35641 that enabled AITER MLA for Kimi TP8, the performance is still unfortunate way slower for MI325 than H200. on MI355, 0.18 with aiter on is signficiantly better than 0.16 but on Mi325 this is not the case.

Althought i haven't ran disagg setting yet on kimi, Single Node Aggregration performance is somewhat good proxy for the disagg performance.

<img width="1151" height="675" alt="Image" src="https://github.com/user-attachments/assets/be4045c5-e7b3-48b7-a125-453fb67bd966" />
set -x
export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--trust-remote-code \
--max-num-seqs 256 \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &

https://github.com/SemiAnalysisAI/InferenceX/blob/8f51204428b21d6639c1ef7fe4b04005e65f70a5/benchmarks/single_node/kimik2.5_int4_mi325x.sh

logs

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23637554623/job/68911904882

##H200 logs & reprod https://github.com/SemiAnalysisAI/InferenceX/commit/749096c4c5ae4dfc37a878cd9081561483282fb8 https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22600499132/job/65492428203

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the performance issue, we will focus on optimizing the vllm serve command and environment variables.

Step-by-Step Solution

  1. Update Environment Variables: Ensure that VLLM_ROCM_USE_AITER is set to 1 to enable AITER MLA for Kimi TP8.
  2. Optimize vllm serve Command:
    • Adjust --gpu-memory-utilization to a lower value (e.g., 0.8) to prevent memory bottlenecks.
    • Experiment with different --block-size values (e.g., 32 or 128) to find the optimal setting for your model and hardware.
    • Consider increasing --max-num-seqs to improve throughput.

Example code snippet:

export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.8 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--trust-remote-code \
--max-num-seqs 512 \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &
  1. Verify Hardware Configuration: Ensure that the MI325 hardware is properly configured and optimized for the workload.

Verification

To verify that the fix worked, monitor the performance metrics (e.g., throughput, latency) and compare them to the previous results. You can use tools like gpu-util or nvidia-smi to monitor GPU utilization and memory usage.

Extra Tips

  • Regularly update your vllm version to ensure you have the latest optimizations and bug fixes.
  • Experiment with different --mm-encoder-tp-mode values (e.g., model, data) to find the optimal setting for your use case.
  • Consider using tools like perf or nvprof to profile your application and identify performance bottlenecks.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING