vllm - 💡(How to fix) Fix Gibberish with flashinfer_nvlink_two_sided on GB200/arm64 [1 participants]

vllm2026-04-13 17:14:40

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39722•Fetched 2026-04-15 06:20:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

S1ro1

Participants

S1ro1

Timeline (top)

renamed ×1

On GB200/arm64, Qwen/Qwen3-30B-A3B-Instruct-2507 with expert parallel enabled and --all2all-backend flashinfer_nvlink_two_sided produces gibberish outputs. This isn't specific to Qwen, seeing the same with GLM etc.

Error Message

The server starts, but responses are gibberish when using flashinfer_nvlink_two_sided for this setup.

Root Cause

Code Example

export HF_HUB_CACHE=/scratch/model-cache
export HF_HUB_OFFLINE=1
export VLLM_ENGINE_READY_TIMEOUT_S=4200
export VLLM_HOST_IP=$POD_IP
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_GID_INDEX=3
export NVSHMEM_CUMEM_HANDLE_TYPE=FABRIC
export NVSHMEM_HCA_PE_MAPPING=mlx5_2:1:4
export NVSHMEM_REMOTE_TRANSPORT=ibdevx
export NVSHMEM_IB_ENABLE_IBGDA=1
export VLLM_USE_FLASHINFER_MOE_FP16=1

vllm serve \
  /scratch/model-cache/models--Qwen--Qwen3-30B-A3B-Instruct-2507/snapshots/0d7cf23991f47feeb3a57ecb4c9cee8ea4a17bfe \
  --served-model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --data-parallel-size 4 \
  --data-parallel-size-local 4 \
  --api-server-count 4 \
  --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_two_sided \
  --enable-prefix-caching \
  --max-model-len 32768

---

curl http://localhost:8000/v1/chat/completions \                                                19:12:09
                             -H "Content-Type: application/json" \
                             -d '{
                           "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
                           "messages": [
                             {"role": "user", "content": "Explain mixture of experts briefly."}
                           ],
                           "max_tokens": 256,
                           "temperature": 0.7
                         }'
{"id":"chatcmpl-ba00b9385bc3b62b","object":"chat.completion","created":1776100330,"model":"Qwen/Qwen3-30B-A3B-Instruct-2507","choices":[{"index":0,"message":{"role":"assistant","content":"1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":15,"total_tokens":271,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}⏎

RAW_BUFFERClick to expand / collapse

Summary

Environment

Hardware:

GB200
arm64
4 GPUs in a single pod / node

Software versions:

vllm==0.19.0
torch==2.10.0+cu128
flashinfer-python==0.6.7.post3
flashinfer-cubin==0.6.7.post3
flashinfer-jit-cache==0.6.7.post3+cu128
deep-ep==1.1.0+814e508
deep-gemm==2.3.0+477618c
nixl==0.10.1
nixl-cu12==1.0.0

Repro command

export HF_HUB_CACHE=/scratch/model-cache
export HF_HUB_OFFLINE=1
export VLLM_ENGINE_READY_TIMEOUT_S=4200
export VLLM_HOST_IP=$POD_IP
export UCX_NET_DEVICES=mlx5_2:1
export UCX_IB_GID_INDEX=3
export NVSHMEM_CUMEM_HANDLE_TYPE=FABRIC
export NVSHMEM_HCA_PE_MAPPING=mlx5_2:1:4
export NVSHMEM_REMOTE_TRANSPORT=ibdevx
export NVSHMEM_IB_ENABLE_IBGDA=1
export VLLM_USE_FLASHINFER_MOE_FP16=1

vllm serve \
  /scratch/model-cache/models--Qwen--Qwen3-30B-A3B-Instruct-2507/snapshots/0d7cf23991f47feeb3a57ecb4c9cee8ea4a17bfe \
  --served-model-name Qwen/Qwen3-30B-A3B-Instruct-2507 \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --gpu-memory-utilization 0.85 \
  --data-parallel-size 4 \
  --data-parallel-size-local 4 \
  --api-server-count 4 \
  --enable-expert-parallel \
  --all2all-backend flashinfer_nvlink_two_sided \
  --enable-prefix-caching \
  --max-model-len 32768

Observed behavior

The server starts, but responses are gibberish when using flashinfer_nvlink_two_sided for this setup.

curl http://localhost:8000/v1/chat/completions \                                                19:12:09
                             -H "Content-Type: application/json" \
                             -d '{
                           "model": "Qwen/Qwen3-30B-A3B-Instruct-2507",
                           "messages": [
                             {"role": "user", "content": "Explain mixture of experts briefly."}
                           ],
                           "max_tokens": 256,
                           "temperature": 0.7
                         }'
{"id":"chatcmpl-ba00b9385bc3b62b","object":"chat.completion","created":1776100330,"model":"Qwen/Qwen3-30B-A3B-Instruct-2507","choices":[{"index":0,"message":{"role":"assistant","content":"1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.1.","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":15,"total_tokens":271,"completion_tokens":256,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}⏎

Expected behavior

Responses should be normal model outputs rather than gibberish.

Notes

This seems specific to the FlashInfer NVLink two-sided MoE path on this setup.

extent analysis

TL;DR

The issue can be resolved by disabling the --all2all-backend flashinfer_nvlink_two_sided option or exploring alternative all2all backends.

Guidance

Investigate the compatibility of the flashinfer_nvlink_two_sided backend with the current hardware and software setup, as it seems to be the root cause of the issue.
Try disabling the --enable-expert-parallel option to see if it affects the output, as expert parallelism might be interacting with the all2all backend in an unexpected way.
Consider testing with a different all2all backend, such as the default or another available option, to isolate the issue.
Verify that the flashinfer packages (flashinfer-python, flashinfer-cubin, and flashinfer-jit-cache) are correctly installed and compatible with the current torch and vllm versions.

Example

No code snippet is provided, as the issue seems to be related to configuration and compatibility rather than code.

Notes

The issue appears to be specific to the FlashInfer NVLink two-sided MoE path on the current setup, and resolving it may require further investigation into the compatibility of the all2all backend with the hardware and software configuration.

Recommendation

Apply a workaround by disabling the --all2all-backend flashinfer_nvlink_two_sided option or exploring alternative all2all backends, as the current setup seems to be incompatible with this specific backend.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Responses should be normal model outputs rather than gibberish.

#api #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix Gibberish with flashinfer_nvlink_two_sided on GB200/arm64 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Repro command

Observed behavior

Expected behavior

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix Gibberish with flashinfer_nvlink_two_sided on GB200/arm64 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Summary

Environment

Repro command

Observed behavior

Expected behavior

Notes

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING