vllm - 💡(How to fix) Fix Gibberish with flashinfer_nvlink_two_sided on GB200/arm64 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39722Fetched 2026-04-15 06:20:46
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
renamed ×1

On GB200/arm64, Qwen/Qwen3-30B-A3B-Instruct-2507 with expert parallel enabled and --all2all-backend flashinfer_nvlink_two_sided produces gibberish outputs. This isn't specific to Qwen, seeing the same with GLM etc.

Error Message

The server starts, but responses are gibberish when using flashinfer_nvlink_two_sided for this setup.

Root Cause

On GB200/arm64, Qwen/Qwen3-30B-A3B-Instruct-2507 with expert parallel enabled and --all2all-backend flashinfer_nvlink_two_sided produces gibberish outputs. This isn't specific to Qwen, seeing the same with GLM etc.

Code Example

export HF_HUB_CACHE=/scratch/model-cache
export HF_HUB_OFFLINE=1
export VLLM_ENGINE_READY_TIMEOUT_S=4200
export VLLM_HOST_IP=$POD_IP
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_GID_INDEX=3
export NVSHMEM_CUMEM_HANDLE_TYPE=FABRIC
export NVSHMEM_HCA_PE_MAPPING=mlx5_2:1:4
export NVSHMEM_REMOTE_TRANSPORT=ibdevx
export NVSHMEM_IB_ENABLE_IBGDA=1
export VLLM_USE_FLASHINFER_MOE_FP16=1

vllm serve \
  /scratch/model-cache/models--Qwen--Qwen3-30B-A3B-Instruct-2507/snapshots/0d7cf23991f47feeb3a57ecb4c9cee8ea4a17bfe \
  --served-model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --data-parallel-size 4 \
  --data-parallel-size-local 4 \
  --api-server-count 4 \
  --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_two_sided \
  --enable-prefix-caching \
  --max-model-len 32768

---

curl http://localhost:8000/v1/chat/completions \                                                19:12:09
                             -H "Content-Type: application/json" \
                             -d '{
                           "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
                           "messages": [
                             {"role": "user", "content": "Explain mixture of experts briefly."}
                           ],
                           "max_tokens": 256,
                           "temperature": 0.7
                         }'
{"id":"chatcmpl-ba00b9385bc3b62b","object":"chat.completion","created":1776100330,"model":"Qwen/Qwen3-30B-A3B-Instruct-2507","choices":[{"index":0,"message":{"role":"assistant","content":"1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":15,"total_tokens":271,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}
RAW_BUFFERClick to expand / collapse

Summary

On GB200/arm64, Qwen/Qwen3-30B-A3B-Instruct-2507 with expert parallel enabled and --all2all-backend flashinfer_nvlink_two_sided produces gibberish outputs. This isn't specific to Qwen, seeing the same with GLM etc.

Environment

Hardware:

  • GB200
  • arm64
  • 4 GPUs in a single pod / node

Software versions:

  • vllm==0.19.0
  • torch==2.10.0+cu128
  • flashinfer-python==0.6.7.post3
  • flashinfer-cubin==0.6.7.post3
  • flashinfer-jit-cache==0.6.7.post3+cu128
  • deep-ep==1.1.0+814e508
  • deep-gemm==2.3.0+477618c
  • nixl==0.10.1
  • nixl-cu12==1.0.0

Repro command

export HF_HUB_CACHE=/scratch/model-cache
export HF_HUB_OFFLINE=1
export VLLM_ENGINE_READY_TIMEOUT_S=4200
export VLLM_HOST_IP=$POD_IP
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_GID_INDEX=3
export NVSHMEM_CUMEM_HANDLE_TYPE=FABRIC
export NVSHMEM_HCA_PE_MAPPING=mlx5_2:1:4
export NVSHMEM_REMOTE_TRANSPORT=ibdevx
export NVSHMEM_IB_ENABLE_IBGDA=1
export VLLM_USE_FLASHINFER_MOE_FP16=1

vllm serve \
  /scratch/model-cache/models--Qwen--Qwen3-30B-A3B-Instruct-2507/snapshots/0d7cf23991f47feeb3a57ecb4c9cee8ea4a17bfe \
  --served-model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --data-parallel-size 4 \
  --data-parallel-size-local 4 \
  --api-server-count 4 \
  --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_two_sided \
  --enable-prefix-caching \
  --max-model-len 32768

Observed behavior

The server starts, but responses are gibberish when using flashinfer_nvlink_two_sided for this setup.

curl http://localhost:8000/v1/chat/completions \                                                19:12:09
                             -H "Content-Type: application/json" \
                             -d '{
                           "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
                           "messages": [
                             {"role": "user", "content": "Explain mixture of experts briefly."}
                           ],
                           "max_tokens": 256,
                           "temperature": 0.7
                         }'
{"id":"chatcmpl-ba00b9385bc3b62b","object":"chat.completion","created":1776100330,"model":"Qwen/Qwen3-30B-A3B-Instruct-2507","choices":[{"index":0,"message":{"role":"assistant","content":"1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":15,"total_tokens":271,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}⏎

Expected behavior

Responses should be normal model outputs rather than gibberish.

Notes

This seems specific to the FlashInfer NVLink two-sided MoE path on this setup.

extent analysis

TL;DR

The issue can be resolved by disabling the --all2all-backend flashinfer_nvlink_two_sided option or exploring alternative all2all backends.

Guidance

  • Investigate the compatibility of the flashinfer_nvlink_two_sided backend with the current hardware and software setup, as it seems to be the root cause of the issue.
  • Try disabling the --enable-expert-parallel option to see if it affects the output, as expert parallelism might be interacting with the all2all backend in an unexpected way.
  • Consider testing with a different all2all backend, such as the default or another available option, to isolate the issue.
  • Verify that the flashinfer packages (flashinfer-python, flashinfer-cubin, and flashinfer-jit-cache) are correctly installed and compatible with the current torch and vllm versions.

Example

No code snippet is provided, as the issue seems to be related to configuration and compatibility rather than code.

Notes

The issue appears to be specific to the FlashInfer NVLink two-sided MoE path on the current setup, and resolving it may require further investigation into the compatibility of the all2all backend with the hardware and software configuration.

Recommendation

Apply a workaround by disabling the --all2all-backend flashinfer_nvlink_two_sided option or exploring alternative all2all backends, as the current setup seems to be incompatible with this specific backend.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Responses should be normal model outputs rather than gibberish.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING