vllm - 💡(How to fix) Fix [Performance]: DeepSeek-V4-Pro 128K+ timeout on deepseekv4-cu130; nightly aa2b56f completes 1M real-prose checks on 8x B200

StepCodex · 2026-05-27T06:15:36Z

[vllm] On a single Nebius 8x NVIDIA B200 SXM6 node, vllm/vllm-openai:deepseekv4-cu130 timed out at the 3600 s client deadline for coherent real-prose, completi… On a single Nebius 8x NVIDIA B200 SXM6 node, `vllm/vllm-openai:deepseekv4-cu130` timed out at the 3600 s client deadline for coherent real-prose, completion-style DeepSeek-V4-Pro requests at 128K, 512K, and 1M input tokens under the published DeepSeek-V4 launch recipe. As a positive control, `vllm/vllm-openai:nightly` build `0.21.1rc1.dev281+gaa2b56ffb` / commit `aa2b56f` completed the same coherent real-prose checks through 1,042,080 input tokens on the same 8x B200 host. The long-context cells are limited-sample checks: n=1 per context length, plus one `nsys` repeat for the nightly 1M path. This report documents that image-specific delta and provides enough environment/kernel evidence for maintainers to decide whether the cu130 path is stale, missing a fix, or should be replaced in docs or recipes. This is not a claim that V4-Pro is generally production-ready on B200. The passing path is a nightly image plus a one-line local derived layer, and the long-context cells are limited-sample validation. ## Fix / Workaround Public release cross-check, refreshed immediately before posting: the vLLM releases page still lists v0.21.0 as the latest release and v0.20.2 as a DeepSeek-V4 patch release. This draft reports the measured nightly `aa2b56f` result only; it does not claim the same behavior on v0.20.2 or v0.21.0. The B200 driver did not expose direct `tensor_active_pct`, so the tensor/SM sample used the local `nvidia-smi dmon -s u` fallback (default 1 Hz). Single 10-sample capture at 1 Hz during a 4096-token decode-heavy request: p50 SM 99%, p50 memory 27%. This is sufficient to confirm kernels are dispatching and the device is busy during decode; it is not a tensor-core utilization measurement. - [ ] Use the same B200-class hardware or disclose differences. - [ ] Use `vllm/vllm-openai:nightly` at commit `aa2b56f` or a later nightly; this report has not bisected which specific PR landed the V4-Pro long-context fix. - [ ] Install `pytest` in the derived image if this nightly import path still requires it. - [ ] Launch with the full DeepSeek-V4 launch recipe above. - [ ] For chat correctness, pass `chat_template_kwargs={"thinking":true,"reasoning_effort":"high"}`. - [ ] Allocate enough `max_tokens` for reasoning plus final content. - [ ] Run the v1.4 7-prompt suite. - [ ] Run at least one coherent real-prose long-context check. Do not use random token IDs for correctness. - [ ] If claiming kernel dispatch, capture startup log evidence plus an `nsys` window or equivalent GPU kernel evidence. ### Summary On a single Nebius 8x NVIDIA B200 SXM6 node, `vllm/vllm-openai:deepseekv4-cu130` timed out at the 3600 s client deadline for coherent real-prose, completion-style DeepSeek-V4-Pro requests at 128K, 512K, and 1M input tokens under the published DeepSeek-V4 launch recipe. As a positive control, `vllm/vllm-openai:nightly` build `0.21.1rc1.dev281+gaa2b56ffb` / commit `aa2b56f` completed the same coherent real-prose checks through 1,042,080 input tokens on the same 8x B200 host. The long-context cells are limited-sample checks: n=1 per context length, plus one `nsys` repeat for the nightly 1M path. This report documents that image-specific delta and provides enough environment/kernel evidence for maintainers to decide whether the cu130 path is stale, missing a fix, or should be replaced in docs or recipes. This is not a claim that V4-Pro is generally production-ready on B200. The passing path is a nightly image plus a one-line local derived layer, and the long-context cells are limited-sample validation. ### Maintainer asks 1. Is `vllm/vllm-openai:deepseekv4-cu130` expected to remain supported for DeepSeek-V4, or should users move to a newer nightly or stable image? 2. Is there a known commit or PR between `deepseekv4-cu130` and `aa2b56f` that explains the 128K+ timeout delta? 3. Should the import-time `cupy.testing` to `pytest` dependency be filed separately? 4. If PR #27114 is expected to apply to V4-Pro, what recipe should be used for a `PIECEWISE` retest? ### Environment | Component | Value | | --- | --- | | Cloud / instance | Nebius AI Cloud, single 8-GPU node `b200-bench-8gpu-1` | | GPUs | 8x NVIDIA B200 SXM6 | | GPU architecture | SM100 | | VRAM | 192 GB HBM3e per GPU | | Driver | 580.126.09 | | Model | `deepseek-ai/DeepSeek-V4-Pro` | | Model on disk | 806 GB, 64 safetensors shards | | Serving image | `vllm/vllm-openai:nightly` plus one derived layer installing `pytest`, local tag `vllm-nightly-fix:local` | | Local derived image ID | `sha256:386e94637c58ba34efec901a678b9f3ff74cec3037513f10a1d7b58c21311b2a`, created 2026-05-26T17:04:55Z | | vLLM version | `0.21.1rc1.dev281+gaa2b56ffb` | | vLLM commit | `aa2b56f` | | Nightly image digest | `sha256:35f29a91c33e632aed24ccec68bb5cf872e1b5066eccc49ff2e406b641964e11` | | Python | 3.12.13 | | CUDA / PyTorch | CUDA

Fix / Workaround

Public release cross-check, refreshed immediately before posting: the vLLM releases page still lists v0.21.0 as the latest release and v0.20.2 as a DeepSeek-V4 patch release. This draft reports the measured nightly aa2b56f result only; it does not claim the same behavior on v0.20.2 or v0.21.0.

The B200 driver did not expose direct tensor_active_pct, so the tensor/SM sample used the local nvidia-smi dmon -s u fallback (default 1 Hz). Single 10-sample capture at 1 Hz during a 4096-token decode-heavy request: p50 SM 99%, p50 memory 27%. This is sufficient to confirm kernels are dispatching and the device is busy during decode; it is not a tensor-core utilization measurement.

Use the same B200-class hardware or disclose differences.
Use vllm/vllm-openai:nightly at commit aa2b56f or a later nightly; this report has not bisected which specific PR landed the V4-Pro long-context fix.
Install pytest in the derived image if this nightly import path still requires it.
Launch with the full DeepSeek-V4 launch recipe above.
For chat correctness, pass chat_template_kwargs={"thinking":true,"reasoning_effort":"high"}.
Allocate enough max_tokens for reasoning plus final content.
Run the v1.4 7-prompt suite.
Run at least one coherent real-prose long-context check. Do not use random token IDs for correctness.
If claiming kernel dispatch, capture startup log evidence plus an nsys window or equivalent GPU kernel evidence.

Summary

On a single Nebius 8x NVIDIA B200 SXM6 node, vllm/vllm-openai:deepseekv4-cu130 timed out at the 3600 s client deadline for coherent real-prose, completion-style DeepSeek-V4-Pro requests at 128K, 512K, and 1M input tokens under the published DeepSeek-V4 launch recipe.

As a positive control, vllm/vllm-openai:nightly build 0.21.1rc1.dev281+gaa2b56ffb / commit aa2b56f completed the same coherent real-prose checks through 1,042,080 input tokens on the same 8x B200 host. The long-context cells are limited-sample checks: n=1 per context length, plus one nsys repeat for the nightly 1M path. This report documents that image-specific delta and provides enough environment/kernel evidence for maintainers to decide whether the cu130 path is stale, missing a fix, or should be replaced in docs or recipes.

This is not a claim that V4-Pro is generally production-ready on B200. The passing path is a nightly image plus a one-line local derived layer, and the long-context cells are limited-sample validation.

Maintainer asks

Is vllm/vllm-openai:deepseekv4-cu130 expected to remain supported for DeepSeek-V4, or should users move to a newer nightly or stable image?
Is there a known commit or PR between deepseekv4-cu130 and aa2b56f that explains the 128K+ timeout delta?
Should the import-time cupy.testing to pytest dependency be filed separately?
If PR #27114 is expected to apply to V4-Pro, what recipe should be used for a PIECEWISE retest?

Environment

Component	Value
Cloud / instance	Nebius AI Cloud, single 8-GPU node `b200-bench-8gpu-1`
GPUs	8x NVIDIA B200 SXM6
GPU architecture	SM100
VRAM	192 GB HBM3e per GPU
Driver	580.126.09
Model	`deepseek-ai/DeepSeek-V4-Pro`
Model on disk	806 GB, 64 safetensors shards
Serving image	`vllm/vllm-openai:nightly` plus one derived layer installing `pytest`, local tag `vllm-nightly-fix:local`
Local derived image ID	`sha256:386e94637c58ba34efec901a678b9f3ff74cec3037513f10a1d7b58c21311b2a`, created 2026-05-26T17:04:55Z
vLLM version	`0.21.1rc1.dev281+gaa2b56ffb`
vLLM commit	`aa2b56f`
Nightly image digest	`sha256:35f29a91c33e632aed24ccec68bb5cf872e1b5066eccc49ff2e406b641964e11`
Python	3.12.13
CUDA / PyTorch	CUDA 13.0, PyTorch `2.11.0+cu130`
Triton	3.6.0
FlashInfer	0.6.11.post2
Transformers / Accelerate	Transformers 5.9.0, Accelerate 1.13.0
NCCL	2.28.9 via `torch.cuda.nccl.version()`
OS / kernel	Ubuntu 24.04.4 LTS, Linux `6.11.0-1016-nvidia`

Provenance: image, version, and commit rows are pulled from the nightly image labels and vllm Python package metadata inside the running container; the Python / CUDA / PyTorch / Triton / FlashInfer / Transformers / Accelerate / NCCL / OS / driver rows are captured from pip freeze, torch.cuda.nccl.version(), nvidia-smi, and uname -a on the test host. Raw capture is available with the proof artifacts on request.

The derived layer is only to satisfy a startup import path in this nightly: vllm.ir.ops.layernorm registration walks through cupy.testing, which imports pytest. No vLLM source change was made. Maintainers may want to look at moving this pytest-touching path out of import time; the derived layer would be unnecessary if the registration did not eagerly walk through cupy.testing.

This report does not bisect which upstream PR landed the V4-Pro long-context fix between the published deepseekv4-cu130 image and nightly aa2b56f. Commit aa2b56f (build 2026-05-26) is the earliest nightly we tested where the 128K+ real-prose timeout resolves.

A related counter-observation for maintainers tracking PR #27114 (the documented Blackwell FULL_AND_PIECEWISE/PIECEWISE plan/runtime mismatch at max_model_len > 131072): in this V4-Pro setup, forcing cudagraph_mode=PIECEWISE did not improve the synthetic random-token-ID probe and made corruption appear at shorter lengths. Because that probe uses synthetic random IDs, I am treating this as a counter-observation only, not a real-prose correctness failure. Real-prose workloads on FULL_AND_PIECEWISE were unaffected.

Reporter context: independent benchmarking work on Nebius AI Cloud 8x B200 hardware. Posting as an individual contributor; not an official Nebius statement.

Closest prior art reviewed before posting: the vLLM 2026-04-24 "DeepSeek-V4 on vLLM" blog (which is the source of the published vllm/vllm-openai:deepseekv4-cu130 image), vLLM PR #27114 (Blackwell FULL_AND_PIECEWISE/PIECEWISE plan/runtime mismatch at max_model_len > 131072), and the DeepSeek-V4-Pro model card on Hugging Face. A search of the vLLM Issues and Discussions inboxes did not surface an existing report of the cu130 128K+ real-prose timeout under the published launch recipe. vLLM GitHub Discussions are pinned as no longer used, and current benchmark/performance reports are present in GitHub Issues; filing here as a performance issue matches the active GitHub venue. If a related report exists, I can cross-link it.

Launch command

docker run --gpus all --ipc=host --network host \
  --shm-size=32g \
  --ulimit nofile=1048576:1048576 \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_HUB_OFFLINE=1 \
  -e NCCL_DEBUG=WARN \
  -e VLLM_ENGINE_READY_TIMEOUT_S=1800 \
  -v /data/models/v4-pro:/models/v4-pro:ro \
  vllm-nightly-fix:local \
  --model /models/v4-pro \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --attention_config.use_fp4_indexer_cache=True \
  --host 127.0.0.1 \
  --port 8000

Request-mode caveat for chat:

{
  "chat_template_kwargs": {
    "thinking": true,
    "reasoning_effort": "high"
  },
  "max_tokens": 4096
}

The output budget matters. In a 1M-token chat spot, max_tokens=512 spent the completion budget on reasoning and returned empty final content; max_tokens=4096 returned clean final content.

Compatibility matrix

Path	Result	Notes
`vllm/vllm-openai:deepseekv4-cu130`, DeepSeek-V4 launch recipe	Timed out at 128K, 512K, and 1M	Each cell hit the 3600 s client deadline; n=1 per context length
`vllm/vllm-openai:nightly` commit `aa2b56f`, same launch recipe	Observed clean through 1,042,080 coherent real-prose input tokens	Same 8x B200 host; v1.4 correctness suite passed under Think High chat request shape; kernel evidence captured
Nightly `aa2b56f`, default chat request shape	Negative control	Arithmetic distractor probe was 23/50; included to avoid reproduction ambiguity, not asserted as a vLLM bug
Nightly `aa2b56f`, explicit non-thinking chat	Negative control	Arithmetic distractor probe was 19/50; included to document request-shape sensitivity

Correctness and request-shape caveats

The current canonical suite is local card-format-standards v1.4: seven prompts covering factual, long-context retrieval, arithmetic, code, multi-turn recall, decode-heavy ASCII output, and JSON/tool calling.

These checks are included to prevent false reproduction failures. The default-chat and explicit non-thinking rows above are not asserted as vLLM bugs.

For chat-shaped correctness checks, the validated request shape used DeepSeek-V4 thinking mode with reasoning_effort: high:

{"chat_template_kwargs":{"thinking":true,"reasoning_effort":"high"}}

The long-context real-prose cells below are completion-style token-ID requests, not the same chat request shape.

V4-Pro nightly aa2b56f, Think High, --min-max-tokens 512:

Prompt	Result	Notes
1 factual	pass	Expected `35`, non-ASCII ratio 0
2 long-context retrieval	pass	Recovered `lampshade-47` from Federalist haystack
3 arithmetic	pass	Expected `43`, non-ASCII ratio 0
4 code	pass	Python `fib` function compiled, no forbidden imports
5 multi-turn	pass	Recovered `Walter` and `marine biologist`
6 decode-heavy	pass	11256 chars, `non_ascii_ratio=0.0`
7 JSON / tool call	pass	Guided JSON pass, free JSON pass, real tool-call path pass

Suite wall time was 86.8 s. The tool-call path emitted a tool_calls array with function name record_person and arguments:

{"name":"Ada Rivers","age":34,"hobby":"tidepool photography"}

Prompt 3 was separately stress-tested because it exposed the default-chat failure mode. Under Think High it was 50/50 clean.

Long-context real-prose envelope

The headline long-context measurements use coherent Federalist text, not synthetic random token IDs. Synthetic random-token-ID prompts produced degenerate output on this model at long context and are treated as a benchmark-protocol artifact for correctness. Random-ID throughput measurements, where reported, should be read as synthetic load-generation throughput only; they are not content-equivalent to real-prose or chat throughput at the same request shape.

Pass criteria for the cells below:

Input: a single list of token IDs constructed from coherent Federalist Papers prose tokenized for V4-Pro (no random IDs).
Sampling: temperature: 0, max_tokens set so the model can return 64 tokens of continuation, max-concurrency 1, sequential.
Timing: wall time measured around the synchronous HTTP request.
"Clean" = non_ascii_ratio of the returned content is 0 and no Chinese de_count repetition is observed; the returned text is coherent Federalist continuation.
Sample size: each cell is n=1. The nsys row is a separate single-request capture under Nsight Systems on the same launch shape. The deepseekv4-cu130 row ran each cell to the 3600 s client deadline.

Input lengths: 128K = 128,256 tokens, 512K = 521,040 tokens, 1M = 1,042,080 tokens (the actual tokenization of the Federalist haystack at each cut).

Image / path	128K wall (s)	512K wall (s)	1M wall (s)	1M result
vLLM nightly `aa2b56f`, FULL_AND_PIECEWISE	12.9	41.6	148.1	clean at 1,042,080 input tokens
vLLM nightly `aa2b56f`, `nsys` repeat	n/a	n/a	152.0	clean at 1,042,080 input tokens
vLLM `deepseekv4-cu130`	TIMEOUT 3600	TIMEOUT 3600	TIMEOUT 3600	no completion

One chat-shaped 1M spot under Think High with max_tokens=4096 returned clean final content:

Prompt tokens	Completion tokens	Wall time	Final content	Cleanliness
1,032,189	493	142.85 s	1301 chars	`cjk_count_content=0`, `de_count_content=0`, `clean=true`

Short-context serving checks

Short-context random-ID serving sweeps were also run and were successful, but they are not load-bearing for this issue. The relevant point is that the reported delta appears at coherent 128K+ real-prose requests rather than at basic server bring-up or short synthetic serving.

Kernel and hardware evidence

The 1M real-prose run was repeated under Nsight Systems on the same launch shape.

Startup/backend evidence includes:

DP/EP launch shape reached warmup across all eight ranks: logs show Worker_DP0_EP0 through Worker_DP7_EP7 and EngineCore_DP0 through EngineCore_DP7.
fp8 KV cache.
FlashInfer top-k/top-p sampling.
PYNCCL data-parallel / expert-parallel communication.
AgRsAll2AllManager.
DeepGEMM warmup.
FlashInfer trtllm_fp4_block_scale_moe autotune.
TileLang MHC kernels.
CUDA graph capture for PIECEWISE prefill/decode and FULL decode.

nsys stats --report cuda_gpu_kern_sum over the captured window includes:

Sparse attention.
deep_gemm::sm100_fp8_fp4_gemm_1d1d_impl.
deep_gemm::sm100_fp4_mqa_logits.
kernel_cutlass...IndexerQMxFp4Kernel.
_fused_kv_compress_norm_rope_insert_indexer_mxfp4_attn.

What this does not claim

Default and explicit non-thinking chat failures are not asserted as vLLM bugs. Those modes failed the arithmetic distractor probe at high rates and are included to prevent request-shape ambiguity.
Based on this host and workload, I would not recommend deepseekv4-cu130 for 128K+ V4-Pro real-prose requests.
Tool-parser coverage is limited to a basic OpenAI-style function tool call. Complex DSML or tool-parser edge cases are not validated by the v1.4 suite.
Stable-tag production posture is not established. The passing path is a nightly build plus a one-line local derived image.
Synthetic random-ID long-context output is not meaningful correctness evidence for this model.
Admission-control and production quota guidance are out of scope. That needs a dedicated KV/concurrency envelope.
Other hardware tiers are out of scope. H200, GB200, and B300 are not tested in this report; B200 8x SXM6 is sufficient for the observed 1M real-prose path.

Reproduction checklist

Use the same B200-class hardware or disclose differences.
Use vllm/vllm-openai:nightly at commit aa2b56f or a later nightly; this report has not bisected which specific PR landed the V4-Pro long-context fix.
Install pytest in the derived image if this nightly import path still requires it.
Launch with the full DeepSeek-V4 launch recipe above.
For chat correctness, pass chat_template_kwargs={"thinking":true,"reasoning_effort":"high"}.
Allocate enough max_tokens for reasoning plus final content.
Run the v1.4 7-prompt suite.
Run at least one coherent real-prose long-context check. Do not use random token IDs for correctness.
If claiming kernel dispatch, capture startup log evidence plus an nsys window or equivalent GPU kernel evidence.

Follow-up checks I can run

Bisect which upstream vLLM PR landed the V4-Pro 128K+ real-prose fix between the cu130 image and nightly aa2b56f, so the fix can be attributed to a specific commit and tracked into a future stable tag.
Look at moving the import-time cupy.testing path out of vllm.ir.ops.layernorm registration so the nightly does not require an external pytest install. I can file a separate, narrowly scoped issue if maintainers want it tracked there.
Re-test on a later nightly (and on v0.21.0 / v0.20.2 stable tags) to confirm the 1M real-prose path holds across image revisions.
If PR #27114 is expected to also cover V4-Pro, the synthetic random-token-ID corruption at 32K under PIECEWISE documented above is a counter-data point worth investigating; I can rerun against any specific recipe maintainers suggest.

Filing as a single performance benchmark/compatibility report because the findings share the same hardware, image, and recipe. I can split into separate issues (cu130 cliff, pytest import path, PR #27114 / PIECEWISE on V4-Pro) if triage prefers narrower surfaces.

Proof artifacts

Available on request from the reporter (raw JSONL, logs, configs):

v1.4 correctness suite run: summary.json with per-prompt verdicts, results.md, per-prompt response captures, wall time 86.82 s.
Request-mode closure run: phase-1 matrix of default-chat vs. explicit-non-thinking vs. Think High vs. Think Max on the prompt 3 distractor task, plus the 50x prompt 3 stress under Think High.
1M chat spot under Think High with max_tokens=4096: request payload, response, and usage object (prompt_tokens=1032189, completion_tokens=493, wall_sec=142.85).
Kernel evidence: nsys stats --report cuda_gpu_kern_sum table over a 30 s decode window, startup-backend log evidence, nvidia-smi dmon -s u decode-load sample.
cu130 real-prose failure: per-context-length request logs showing the 3600 s timeout at 128K, 512K, and 1M.

I can attach selected JSONL excerpts inline as code blocks in a follow-up comment, post the staged compact GitHub gist bundle for the v1.4 summary.json plus the 1M chat-spot usage object (single gist, public, no expiry), or share a tarball of the raw run directories on request (GitHub Releases asset attached to this draft account, retained 90 days). Internal canonical paths are kept off this post; if you want to see the on-box capture process, ask and I can publish a sanitized walkthrough.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Performance]: DeepSeek-V4-Pro 128K+ timeout on deepseekv4-cu130; nightly aa2b56f completes 1M real-prose checks on 8x B200

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action