vllm - 💡(How to fix) Fix [Performance]: Qwen3-VL-235B-A22B-Instruct NVFP4 Performance dropped at upstream main

StepCodex · 2026-05-08T16:56:14Z

[vllm] Proposal to improve performance Performance regression on Qwen3-VL-235B-A22B-Instruct NVFP4, GB300 / CUDA 13 — TTFT P99 +55-67%, plus several rough edge… ## Fix / Workaround Plus three smaller papercuts that are worth filing alongside (see "Workarounds needed"). ## Workarounds needed to make upstream nightly serve at all 1. **`flashinfer_autotune` crashes on GB300 with `CUDA error: an illegal memory access`** during `kernel_warmup`. Stack trace points to `flashinfer.fused_moe.core` → `flashinfer.autotuner._prepare_input_tensors` → `torch.rand` on the routing-logits tensor. Workaround: `--kernel-config '{"enable_flashinfer_autotune": false}'`. ### Proposal to improve performance # Performance regression on Qwen3-VL-235B-A22B-Instruct (NVFP4, GB300 / CUDA 13) — TTFT P99 +55-67%, plus several rough edges in current main ## TL;DR Running Qwen3-VL-235B-A22B-Instruct (NVFP4 via compressed-tensors) on GB300 with TP=4 against the latest upstream nightly (`vllm/vllm-openai:nightly`, vllm `0.20.2rc1.dev93+g51f22dcfd`, May 7 2026, cu130, torch 2.11.0+cu130) shows a significant prefill-path regression vs a fork branched from main near `1cbbcfe8a334bab004c43a60be201c8ab528e0d2` (Mar 23 2026): - **TTFT P99: +55%** (fp8 ViT static scale) to **+67%** (fp8 ViT OFF) - **LAT P99: +28% to +36%** - **TPOT P99: ~0% (slightly faster, −7%)** — decode itself is fine in upstream Plus three smaller papercuts that are worth filing alongside (see "Workarounds needed"). ## Setup - Hardware: GB300 (compute capability 10.0), 4 GPUs, NVLink, NVIDIA driver 580.159.04 - Model: `nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0` (compressed-tensors NVFP4) - Workload: online benchmark, target QPS=5 (Poisson), 1000 samples, num_workers=5, mixed multimodal Shopify product-catalogue dataset - Engine config: TP=4, `max_model_len=32768`, `max_num_seqs=1024` - Each cell: 1 cold + 3 warm runs against the same warmed server; mean is over the 3 warm runs (CV < 2%) ## Workarounds needed to make upstream nightly serve at all These came up while preparing the comparison and should probably be tracked as separate bugs / docs gaps: 1. **`flashinfer_autotune` crashes on GB300 with `CUDA error: an illegal memory access`** during `kernel_warmup`. Stack trace points to `flashinfer.fused_moe.core` → `flashinfer.autotuner._prepare_input_tensors` → `torch.rand` on the routing-logits tensor. Workaround: `--kernel-config '{"enable_flashinfer_autotune": false}'`. ## Reproducer ```bash # Container: vllm/vllm-openai:nightly (multi-arch; pulls cu130+torch2.11 on aarch64) # 1) one-time fp8 ViT scale calibration vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \ --tensor-parallel-size=4 --async-scheduling \ --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \ --mm-encoder-attn-backend=FLASHINFER \ --mm-encoder-attn-dtype=fp8 \ --mm-encoder-fp8-scale-save-path=/tmp/q3vl_fp8_scales.json \ --no-enable-prefix-caching \ --kernel-config='{"enable_flashinfer_autotune": false}' \ --compilation-config='{"cudagraph_mm_encoder":"true","encoder_cudagraph_token_budgets":[1024,2048,3072,4096,5120,6144,7168,8192],"encoder_cudagraph_max_vision_items_per_batch":16,"encoder_cudagraph_max_frames_per_batch":16,"max_cudagraph_capture_size":13824}' \ & # fire >=16 multimodal requests so the amax history wraps and scales are dumped to /tmp/q3vl_fp8_scales.json, # then kill the server. # 2) actual run with static scales vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \ --tensor-parallel-size=4 --async-scheduling \ --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \ --mm-encoder-attn-backend=FLASHINFER \ --mm-encoder-attn-dtype=fp8 \ --mm-encoder-fp8-scale-path=/tmp/q3vl_fp8_scales.json \ --no-enable-prefix-caching \ --kernel-config='{"enable_flashinfer_autotune": false}' \ --compilation-config='{...same as above...}' ``` Then run an online Poisson benchmark at QPS=5 for 1000 samples; report `latency.percentiles["99"]`, `ttft.percentiles["99"]`, `tpot.percentiles["99"]` from the warm-state samples (we discard the first cold bench). ## Numbers (warm-mean of 3 reps; ms unless noted; CV < 2%) | cell | LAT P50 | **LAT P99** | TTFT P50 | **TTFT P99** | TPOT P50 | **TPOT P99** | |-----------------------------------------------|---------|-------------|----------|--------------|----------|--------------| | Reference fork @ main `1cbbcfe8` (Mar 23) | 608.7 | **1621.3** | 103.7 | **744.2** | 11.6 | **27.7** | | Upstream nightly `51f22dcfd` (May 7) fp8 stat | 585.2 | **2067.6** | 140.4 | **1151.7** | 10.3 | **25.7** | | Upstream nightly `51f22dcfd` (May 7) fp8 OFF | 596.6 | **2205.7** | 154.9 | **1244.9** | 10.2 | **26.5** | Regression vs reference fork: | metric | upstream fp8 static | upstream fp8 OFF | |-----------|--------------------:|-----------------:| | LAT P50 | −3.9 % | −2.0 % | | **LAT P99** | **+27.5 %** | **+36.0 %** | | TTFT P50 | +35.4 % | +49.4 % | | **TTFT

vllm2026-05-08 16:56:14

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

flashinfer_autotune crashes on GB300 with CUDA error: an illegal memory access during kernel_warmup. Stack trace points to flashinfer.fused_moe.core → flashinfer.autotuner._prepare_input_tensors → torch.rand on the routing-logits tensor. Workaround: --kernel-config '{"enable_flashinfer_autotune": false}'.

Fix Action

Fix / Workaround

Plus three smaller papercuts that are worth filing alongside (see "Workarounds needed").

Workarounds needed to make upstream nightly serve at all

flashinfer_autotune crashes on GB300 with CUDA error: an illegal memory access during kernel_warmup. Stack trace points to flashinfer.fused_moe.core → flashinfer.autotuner._prepare_input_tensors → torch.rand on the routing-logits tensor. Workaround: --kernel-config '{"enable_flashinfer_autotune": false}'.

Code Example

# Container: vllm/vllm-openai:nightly  (multi-arch; pulls cu130+torch2.11 on aarch64)

# 1) one-time fp8 ViT scale calibration
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-save-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{"cudagraph_mm_encoder":"true","encoder_cudagraph_token_budgets":[1024,2048,3072,4096,5120,6144,7168,8192],"encoder_cudagraph_max_vision_items_per_batch":16,"encoder_cudagraph_max_frames_per_batch":16,"max_cudagraph_capture_size":13824}' \
  &
# fire >=16 multimodal requests so the amax history wraps and scales are dumped to /tmp/q3vl_fp8_scales.json,
# then kill the server.

# 2) actual run with static scales
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{...same as above...}'

---

The output of `python collect_env.py`

RAW_BUFFERClick to expand / collapse

Proposal to improve performance

Performance regression on Qwen3-VL-235B-A22B-Instruct (NVFP4, GB300 / CUDA 13) — TTFT P99 +55-67%, plus several rough edges in current main

TL;DR

Running Qwen3-VL-235B-A22B-Instruct (NVFP4 via compressed-tensors) on GB300 with TP=4 against the latest upstream nightly (vllm/vllm-openai:nightly, vllm 0.20.2rc1.dev93+g51f22dcfd, May 7 2026, cu130, torch 2.11.0+cu130) shows a significant prefill-path regression vs a fork branched from main near 1cbbcfe8a334bab004c43a60be201c8ab528e0d2 (Mar 23 2026):

TTFT P99: +55% (fp8 ViT static scale) to +67% (fp8 ViT OFF)
LAT P99: +28% to +36%
TPOT P99: ~0% (slightly faster, −7%) — decode itself is fine in upstream

Plus three smaller papercuts that are worth filing alongside (see "Workarounds needed").

Setup

Hardware: GB300 (compute capability 10.0), 4 GPUs, NVLink, NVIDIA driver 580.159.04
Model: nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 (compressed-tensors NVFP4)
Workload: online benchmark, target QPS=5 (Poisson), 1000 samples, num_workers=5, mixed multimodal Shopify product-catalogue dataset
Engine config: TP=4, max_model_len=32768, max_num_seqs=1024
Each cell: 1 cold + 3 warm runs against the same warmed server; mean is over the 3 warm runs (CV < 2%)

Workarounds needed to make upstream nightly serve at all

These came up while preparing the comparison and should probably be tracked as separate bugs / docs gaps:

flashinfer_autotune crashes on GB300 with CUDA error: an illegal memory access during kernel_warmup. Stack trace points to flashinfer.fused_moe.core → flashinfer.autotuner._prepare_input_tensors → torch.rand on the routing-logits tensor. Workaround: --kernel-config '{"enable_flashinfer_autotune": false}'.

Reproducer

# Container: vllm/vllm-openai:nightly  (multi-arch; pulls cu130+torch2.11 on aarch64)

# 1) one-time fp8 ViT scale calibration
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-save-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{"cudagraph_mm_encoder":"true","encoder_cudagraph_token_budgets":[1024,2048,3072,4096,5120,6144,7168,8192],"encoder_cudagraph_max_vision_items_per_batch":16,"encoder_cudagraph_max_frames_per_batch":16,"max_cudagraph_capture_size":13824}' \
  &
# fire >=16 multimodal requests so the amax history wraps and scales are dumped to /tmp/q3vl_fp8_scales.json,
# then kill the server.

# 2) actual run with static scales
vllm serve nvidia/Qwen3-VL-235B-A22B-Instruct-NVFP4-MLPerf-Inference-Closed-V6.0 \
  --tensor-parallel-size=4 --async-scheduling \
  --max-model-len=32768 --max-num-seqs=1024 --max-num-batched-tokens=13824 \
  --mm-encoder-attn-backend=FLASHINFER \
  --mm-encoder-attn-dtype=fp8 \
  --mm-encoder-fp8-scale-path=/tmp/q3vl_fp8_scales.json \
  --no-enable-prefix-caching \
  --kernel-config='{"enable_flashinfer_autotune": false}' \
  --compilation-config='{...same as above...}'

Then run an online Poisson benchmark at QPS=5 for 1000 samples; report latency.percentiles["99"], ttft.percentiles["99"], tpot.percentiles["99"] from the warm-state samples (we discard the first cold bench).

Numbers (warm-mean of 3 reps; ms unless noted; CV < 2%)

cell	LAT P50	LAT P99	TTFT P50	TTFT P99	TPOT P50	TPOT P99
Reference fork @ main `1cbbcfe8` (Mar 23)	608.7	1621.3	103.7	744.2	11.6	27.7
Upstream nightly `51f22dcfd` (May 7) fp8 stat	585.2	2067.6	140.4	1151.7	10.3	25.7
Upstream nightly `51f22dcfd` (May 7) fp8 OFF	596.6	2205.7	154.9	1244.9	10.2	26.5

Regression vs reference fork:

metric	upstream fp8 static	upstream fp8 OFF
LAT P50	−3.9 %	−2.0 %
LAT P99	+27.5 %	+36.0 %
TTFT P50	+35.4 %	+49.4 %
TTFT P99	+54.8 %	+67.3 %
TPOT P50	−11.2 %	−12.1 %
TPOT P99	−7.2 %	−4.3 %

Reading: TPOT (decode) is unchanged-to-slightly-faster on upstream. The LAT P99 widening is driven entirely by TTFT — the prefill / encoder-side path is markedly slower in current upstream main than in the Mar 23 fork point.

Internal bisection signal (for reviewers)

We did extensive ablations on a separate fork (B-fork ≈ main as of Apr 16) that also shows a regression vs the same Mar 23 base. The decomposition (warm steady-state, same node, B vs A on the fork):

factor	TTFT P99 closure	LAT P99 closure	TPOT P99 closure
Encoder cudagraph (vllm #38061)	~0pp	~0pp	~0pp
FlashInfer 0.6.6 ⇄ 0.6.7 (whole dist-packages tar)	3.2pp	2.9pp	6.1pp
compressed-tensors 0.14 ⇄ 0.15	0.7pp	1.5pp	8.4pp
FlashInfer FP4 MoE: `VLLM_USE_FLASHINFER_MOE_FP4=1`→`0` (B side only)	1.3pp	8.6pp	11.9pp
FlashInfer sampler: `VLLM_USE_FLASHINFER_SAMPLER=1`→`0` (B side only)	5.7pp	1.4pp	~0pp
Async scheduling	~0pp	~0pp	~0pp
transformers 4.57 ⇄ 5.6 + tokenizers + huggingface_hub	~0pp	~0pp	~0pp
FP8 ViT attn ON vs OFF (intra-container)	(perf ~+17%)	(perf ~+17%)	(perf ~+10%)
Residual (vllm code, torch 2.10→2.11, cublas/cudnn libs)	~14pp	~9pp	~1pp

Notes on the two FlashInfer env-var ablations:

VLLM_USE_FLASHINFER_MOE_FP4=0 makes the MoE layer of Qwen3-VL-235B-A22B-Instruct-NVFP4 skip FlashInfer's NVFP4 fused-MoE GEMM and fall back to the default vllm MoE path (CUTLASS / triton, depending on layer). It only flips on the B (regressed) side, so the closure value is the difference between B-with-FP4-MoE-env=1 and B-with-FP4-MoE-env=0, both compared against the same A baseline (which already runs with =1). Closure means "how much of the B vs A gap goes away when we disable the new FlashInfer FP4 MoE on B". VLLM_FLASHINFER_MOE_BACKEND=latency was kept the default in both cases (we also tried =throughput; it hung A and made B worse, so it's irrelevant).
VLLM_USE_FLASHINFER_SAMPLER=0 disables the FlashInfer Top-K/Top-P / sample-fused kernel and falls back to the default vllm sampler (vllm/v1/sample/sampler.py). Same protocol — flipped only on B. Closure = (B/A with sampler env=1) − (B/A with sampler env=0).

(Numbers caveated by ~3pp inter-node hardware variance on GB300 — same-node measurements are tight, < 1.5% σ at LAT P99.)

The "residual" bucket — what's left after we ablate every available flag/env/library swap — is what the +55-67% TTFT P99 widening on upstream most likely lives in.

Likely-affected files (based on the affected window):

vllm/v1/attention/ — encoder attention dispatch
vllm/model_executor/models/qwen3_vl.py and qwen3_omni_moe_thinker.py
vllm/v1/worker/gpu_model_runner.py
vllm/compilation/ — encoder cudagraph capture path

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#API middleware #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: Qwen3-VL-235B-A22B-Instruct NVFP4 Performance dropped at upstream main

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Workarounds needed to make upstream nightly serve at all

Code Example

Proposal to improve performance

Performance regression on Qwen3-VL-235B-A22B-Instruct (NVFP4, GB300 / CUDA 13) — TTFT P99 +55-67%, plus several rough edges in current main

TL;DR

Setup

Workarounds needed to make upstream nightly serve at all

Reproducer

Numbers (warm-mean of 3 reps; ms unless noted; CV < 2%)

Internal bisection signal (for reviewers)

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Performance]: Qwen3-VL-235B-A22B-Instruct NVFP4 Performance dropped at upstream main

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fix / Workaround

Workarounds needed to make upstream nightly serve at all

Code Example

Proposal to improve performance

Performance regression on Qwen3-VL-235B-A22B-Instruct (NVFP4, GB300 / CUDA 13) — TTFT P99 +55-67%, plus several rough edges in current main

TL;DR

Setup

Workarounds needed to make upstream nightly serve at all

Reproducer

Numbers (warm-mean of 3 reps; ms unless noted; CV < 2%)

Internal bisection signal (for reviewers)

Report of performance regression

Misc discussion on performance

Your current environment (if you think it is necessary)

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING