vllm - 💡(How to fix) Fix [Performance]: Qwen3-VL-235B-A22B-Instruct NVFP4 Performance dropped at upstream main

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

  1. flashinfer_autotune crashes on GB300 with CUDA error: an illegal memory access during kernel_warmup. Stack trace points to flashinfer.fused_moe.coreflashinfer.autotuner._prepare_input_tensorstorch.rand on the routing-logits tensor. Workaround: --kernel-config '{"enable_flashinfer_autotune": false}'.

Fix Action

Fix / Workaround

Plus three smaller papercuts that are worth filing alongside (see "Workarounds needed").

Workarounds needed to make upstream nightly serve at all

  1. flashinfer_autotune crashes on GB300 with CUDA error: an illegal memory access during kernel_warmup. Stack trace points to flashinfer.fused_moe.coreflashinfer.autotuner._prepare_input_tensorstorch.rand on the routing-logits tensor. Workaround: --kernel-config '{"enable_flashinfer_autotune": false}'.

Code Example

# Container: vllm/vllm-openai:nightly  (multi-arch; pulls cu130+torch2.11 on aarch64)

# 1) one-time fp8 ViT scale calibration
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-save-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{"cudagraph_mm_encoder":"true","encoder_cudagraph_token_budgets":[1024,2048,3072,4096,5120,6144,7168,8192],"encoder_cudagraph_max_vision_items_per_batch":16,"encoder_cudagraph_max_frames_per_batch":16,"max_cudagraph_capture_size":13824}' \
  &
# fire >=16 multimodal requests so the amax history wraps and scales are dumped to /tmp/q3vl_fp8_scales.json,
# then kill the server.

# 2) actual run with static scales
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{...same as above...}'

---

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Performance regression on Qwen3-VL-235B-A22B-Instruct (NVFP4, GB300 / CUDA 13) — TTFT P99 +55-67%, plus several rough edges in current main

TL;DR

Running Qwen3-VL-235B-A22B-Instruct (NVFP4 via compressed-tensors) on GB300 with TP=4 against the latest upstream nightly (vllm/vllm-openai:nightly, vllm 0.20.2rc1.dev93+g51f22dcfd, May 7 2026, cu130, torch 2.11.0+cu130) shows a significant prefill-path regression vs a fork branched from main near 1cbbcfe8a334bab004c43a60be201c8ab528e0d2 (Mar 23 2026):

  • TTFT P99: +55% (fp8 ViT static scale) to +67% (fp8 ViT OFF)
  • LAT P99: +28% to +36%
  • TPOT P99: ~0% (slightly faster, −7%) — decode itself is fine in upstream

Plus three smaller papercuts that are worth filing alongside (see "Workarounds needed").

Setup

  • Hardware: GB300 (compute capability 10.0), 4 GPUs, NVLink, NVIDIA driver 580.159.04
  • Model: nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 (compressed-tensors NVFP4)
  • Workload: online benchmark, target QPS=5 (Poisson), 1000 samples, num_workers=5, mixed multimodal Shopify product-catalogue dataset
  • Engine config: TP=4, max_model_len=32768, max_num_seqs=1024
  • Each cell: 1 cold + 3 warm runs against the same warmed server; mean is over the 3 warm runs (CV < 2%)

Workarounds needed to make upstream nightly serve at all

These came up while preparing the comparison and should probably be tracked as separate bugs / docs gaps:

  1. flashinfer_autotune crashes on GB300 with CUDA error: an illegal memory access during kernel_warmup. Stack trace points to flashinfer.fused_moe.coreflashinfer.autotuner._prepare_input_tensorstorch.rand on the routing-logits tensor. Workaround: --kernel-config '{"enable_flashinfer_autotune": false}'.

Reproducer

# Container: vllm/vllm-openai:nightly  (multi-arch; pulls cu130+torch2.11 on aarch64)

# 1) one-time fp8 ViT scale calibration
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-save-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{"cudagraph_mm_encoder":"true","encoder_cudagraph_token_budgets":[1024,2048,3072,4096,5120,6144,7168,8192],"encoder_cudagraph_max_vision_items_per_batch":16,"encoder_cudagraph_max_frames_per_batch":16,"max_cudagraph_capture_size":13824}' \
  &
# fire >=16 multimodal requests so the amax history wraps and scales are dumped to /tmp/q3vl_fp8_scales.json,
# then kill the server.

# 2) actual run with static scales
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{...same as above...}'

Then run an online Poisson benchmark at QPS=5 for 1000 samples; report latency.percentiles["99"], ttft.percentiles["99"], tpot.percentiles["99"] from the warm-state samples (we discard the first cold bench).

Numbers (warm-mean of 3 reps; ms unless noted; CV < 2%)

cellLAT P50LAT P99TTFT P50TTFT P99TPOT P50TPOT P99
Reference fork @ main 1cbbcfe8 (Mar 23)608.71621.3103.7744.211.627.7
Upstream nightly 51f22dcfd (May 7) fp8 stat585.22067.6140.41151.710.325.7
Upstream nightly 51f22dcfd (May 7) fp8 OFF596.62205.7154.91244.910.226.5

Regression vs reference fork:

metricupstream fp8 staticupstream fp8 OFF
LAT P50−3.9 %−2.0 %
LAT P99+27.5 %+36.0 %
TTFT P50+35.4 %+49.4 %
TTFT P99+54.8 %+67.3 %
TPOT P50−11.2 %−12.1 %
TPOT P99−7.2 %−4.3 %

Reading: TPOT (decode) is unchanged-to-slightly-faster on upstream. The LAT P99 widening is driven entirely by TTFT — the prefill / encoder-side path is markedly slower in current upstream main than in the Mar 23 fork point.

Internal bisection signal (for reviewers)

We did extensive ablations on a separate fork (B-fork ≈ main as of Apr 16) that also shows a regression vs the same Mar 23 base. The decomposition (warm steady-state, same node, B vs A on the fork):

factorTTFT P99 closureLAT P99 closureTPOT P99 closure
Encoder cudagraph (vllm #38061)~0pp~0pp~0pp
FlashInfer 0.6.6 ⇄ 0.6.7 (whole dist-packages tar)3.2pp2.9pp6.1pp
compressed-tensors 0.14 ⇄ 0.150.7pp1.5pp8.4pp
FlashInfer FP4 MoE: VLLM_USE_FLASHINFER_MOE_FP4=10 (B side only)1.3pp8.6pp11.9pp
FlashInfer sampler: VLLM_USE_FLASHINFER_SAMPLER=10 (B side only)5.7pp1.4pp~0pp
Async scheduling~0pp~0pp~0pp
transformers 4.57 ⇄ 5.6 + tokenizers + huggingface_hub~0pp~0pp~0pp
FP8 ViT attn ON vs OFF (intra-container)(perf ~+17%)(perf ~+17%)(perf ~+10%)
Residual (vllm code, torch 2.10→2.11, cublas/cudnn libs)~14pp~9pp~1pp

Notes on the two FlashInfer env-var ablations:

  • VLLM_USE_FLASHINFER_MOE_FP4=0 makes the MoE layer of Qwen3-VL-235B-A22B-Instruct-NVFP4 skip FlashInfer's NVFP4 fused-MoE GEMM and fall back to the default vllm MoE path (CUTLASS / triton, depending on layer). It only flips on the B (regressed) side, so the closure value is the difference between B-with-FP4-MoE-env=1 and B-with-FP4-MoE-env=0, both compared against the same A baseline (which already runs with =1). Closure means "how much of the B vs A gap goes away when we disable the new FlashInfer FP4 MoE on B". VLLM_FLASHINFER_MOE_BACKEND=latency was kept the default in both cases (we also tried =throughput; it hung A and made B worse, so it's irrelevant).

  • VLLM_USE_FLASHINFER_SAMPLER=0 disables the FlashInfer Top-K/Top-P / sample-fused kernel and falls back to the default vllm sampler (vllm/v1/sample/sampler.py). Same protocol — flipped only on B. Closure = (B/A with sampler env=1) − (B/A with sampler env=0).

(Numbers caveated by ~3pp inter-node hardware variance on GB300 — same-node measurements are tight, < 1.5% σ at LAT P99.)

The "residual" bucket — what's left after we ablate every available flag/env/library swap — is what the +55-67% TTFT P99 widening on upstream most likely lives in.

Likely-affected files (based on the affected window):

  • vllm/v1/attention/ — encoder attention dispatch
  • vllm/model_executor/models/qwen3_vl.py and qwen3_omni_moe_thinker.py
  • vllm/v1/worker/gpu_model_runner.py
  • vllm/compilation/ — encoder cudagraph capture path

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Qwen3-VL-235B-A22B-Instruct NVFP4 Performance dropped at upstream main