vllm - 💡(How to fix) Fix [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node [3 comments, 2 participants]

vllm2026-03-27 23:34:30

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38406•Fetched 2026-04-08 01:41:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

functionstackx

Participants

functionstackx

github-actions[bot]

Timeline (top)

mentioned ×7subscribed ×7commented ×3labeled ×2

Error Message

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23637554623/job/68911904882

##H200 logs & reprod https://github.com/SemiAnalysisAI/InferenceX/commit/749096c4c5ae4dfc37a878cd9081561483282fb8 https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22600499132/job/65492428203

Fix Action

Fix / Workaround

even after the patch with export VLLM_ROCM_USE_AITER=1 https://github.com/vllm-project/vllm/issues/35641 that enabled AITER MLA for Kimi TP8, the performance is still unfortunate way slower for MI325 than H200. on MI355, 0.18 with aiter on is signficiantly better than 0.16 but on Mi325 this is not the case.

Code Example

set -x
export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--trust-remote-code \
--max-num-seqs 256 \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &

RAW_BUFFERClick to expand / collapse

Your current environment

vllm/vllm-openai-rocm:v0.18.0

🐛 Describe the bug

hi @powderluv @chunfangamd @andyluo7 @hongxiayang

Althought i haven't ran disagg setting yet on kimi, Single Node Aggregration performance is somewhat good proxy for the disagg performance.

set -x
export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.95 \
--max-model-len $MAX_MODEL_LEN \
--block-size=64 \
--trust-remote-code \
--max-num-seqs 256 \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &

https://github.com/SemiAnalysisAI/InferenceX/blob/8f51204428b21d6639c1ef7fe4b04005e65f70a5/benchmarks/single_node/kimik2.5_int4_mi325x.sh

logs

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/23637554623/job/68911904882

##H200 logs & reprod https://github.com/SemiAnalysisAI/InferenceX/commit/749096c4c5ae4dfc37a878cd9081561483282fb8 https://github.com/SemiAnalysisAI/InferenceX/actions/runs/22600499132/job/65492428203

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the performance issue, we will focus on optimizing the vllm serve command and environment variables.

Step-by-Step Solution

Update Environment Variables: Ensure that VLLM_ROCM_USE_AITER is set to 1 to enable AITER MLA for Kimi TP8.
Optimize vllm serve Command:
- Adjust --gpu-memory-utilization to a lower value (e.g., 0.8) to prevent memory bottlenecks.
- Experiment with different --block-size values (e.g., 32 or 128) to find the optimal setting for your model and hardware.
- Consider increasing --max-num-seqs to improve throughput.

Example code snippet:

export VLLM_ROCM_USE_AITER=1
vllm serve $MODEL --port $PORT \
--tensor-parallel-size=$TP \
--gpu-memory-utilization 0.8 \
--max-model-len $MAX_MODEL_LEN \
--block-size=32 \
--trust-remote-code \
--max-num-seqs 512 \
--mm-encoder-tp-mode data > $SERVER_LOG 2>&1 &

Verify Hardware Configuration: Ensure that the MI325 hardware is properly configured and optimized for the workload.

Verification

To verify that the fix worked, monitor the performance metrics (e.g., throughput, latency) and compare them to the previous results. You can use tools like gpu-util or nvidia-smi to monitor GPU utilization and memory usage.

Extra Tips

Regularly update your vllm version to ensure you have the latest optimizations and bug fixes.
Experiment with different --mm-encoder-tp-mode values (e.g., model, data) to find the optimal setting for your use case.
Consider using tools like perf or nvprof to profile your application and identify performance bottlenecks.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#callback error #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

logs

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vllm 0.18 kimi k2.5 way worse than h200 single node [3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

logs

Before submitting a new issue...

extent analysis

Fix Plan

Step-by-Step Solution

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING