vllm - 💡(How to fix) Fix [Performance]: DeepSeek-V4-Pro 128K+ timeout on deepseekv4-cu130; nightly aa2b56f completes 1M real-prose checks on 8x B200

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On a single Nebius 8x NVIDIA B200 SXM6 node, vllm/vllm-openai:deepseekv4-cu130 timed out at the 3600 s client deadline for coherent real-prose, completion-style DeepSeek-V4-Pro requests at 128K, 512K, and 1M input tokens under the published DeepSeek-V4 launch recipe.

As a positive control, vllm/vllm-openai:nightly build 0.21.1rc1.dev281+gaa2b56ffb / commit aa2b56f completed the same coherent real-prose checks through 1,042,080 input tokens on the same 8x B200 host. The long-context cells are limited-sample checks: n=1 per context length, plus one nsys repeat for the nightly 1M path. This report documents that image-specific delta and provides enough environment/kernel evidence for maintainers to decide whether the cu130 path is stale, missing a fix, or should be replaced in docs or recipes.

This is not a claim that V4-Pro is generally production-ready on B200. The passing path is a nightly image plus a one-line local derived layer, and the long-context cells are limited-sample validation.

Error Message

-e NCCL_DEBUG=WARN \

Root Cause

A related counter-observation for maintainers tracking PR #27114 (the documented Blackwell FULL_AND_PIECEWISE/PIECEWISE plan/runtime mismatch at max_model_len > 131072): in this V4-Pro setup, forcing cudagraph_mode=PIECEWISE did not improve the synthetic random-token-ID probe and made corruption appear at shorter lengths. Because that probe uses synthetic random IDs, I am treating this as a counter-observation only, not a real-prose correctness failure. Real-prose workloads on FULL_AND_PIECEWISE were unaffected.

Fix Action

Fix / Workaround

Public release cross-check, refreshed immediately before posting: the vLLM releases page still lists v0.21.0 as the latest release and v0.20.2 as a DeepSeek-V4 patch release. This draft reports the measured nightly aa2b56f result only; it does not claim the same behavior on v0.20.2 or v0.21.0.

The B200 driver did not expose direct tensor_active_pct, so the tensor/SM sample used the local nvidia-smi dmon -s u fallback (default 1 Hz). Single 10-sample capture at 1 Hz during a 4096-token decode-heavy request: p50 SM 99%, p50 memory 27%. This is sufficient to confirm kernels are dispatching and the device is busy during decode; it is not a tensor-core utilization measurement.

  • Use the same B200-class hardware or disclose differences.
  • Use vllm/vllm-openai:nightly at commit aa2b56f or a later nightly; this report has not bisected which specific PR landed the V4-Pro long-context fix.
  • Install pytest in the derived image if this nightly import path still requires it.
  • Launch with the full DeepSeek-V4 launch recipe above.
  • For chat correctness, pass chat_template_kwargs={"thinking":true,"reasoning_effort":"high"}.
  • Allocate enough max_tokens for reasoning plus final content.
  • Run the v1.4 7-prompt suite.
  • Run at least one coherent real-prose long-context check. Do not use random token IDs for correctness.
  • If claiming kernel dispatch, capture startup log evidence plus an nsys window or equivalent GPU kernel evidence.

Code Example

docker run --gpus all --ipc=host --network host \
  --shm-size=32g \
  --ulimit nofile=1048576:1048576 \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_HUB_OFFLINE=1 \
  -e NCCL_DEBUG=WARN \
  -e VLLM_ENGINE_READY_TIMEOUT_S=1800 \
  -v /data/models/v4-pro:/models/v4-pro:ro \
  vllm-nightly-fix:local \
  --model /models/v4-pro \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --attention_config.use_fp4_indexer_cache=True \
  --host 127.0.0.1 \
  --port 8000

---

{
  "chat_template_kwargs": {
    "thinking": true,
    "reasoning_effort": "high"
  },
  "max_tokens": 4096
}

---

{"chat_template_kwargs":{"thinking":true,"reasoning_effort":"high"}}

---

{"name":"Ada Rivers","age":34,"hobby":"tidepool photography"}
RAW_BUFFERClick to expand / collapse

Summary

On a single Nebius 8x NVIDIA B200 SXM6 node, vllm/vllm-openai:deepseekv4-cu130 timed out at the 3600 s client deadline for coherent real-prose, completion-style DeepSeek-V4-Pro requests at 128K, 512K, and 1M input tokens under the published DeepSeek-V4 launch recipe.

As a positive control, vllm/vllm-openai:nightly build 0.21.1rc1.dev281+gaa2b56ffb / commit aa2b56f completed the same coherent real-prose checks through 1,042,080 input tokens on the same 8x B200 host. The long-context cells are limited-sample checks: n=1 per context length, plus one nsys repeat for the nightly 1M path. This report documents that image-specific delta and provides enough environment/kernel evidence for maintainers to decide whether the cu130 path is stale, missing a fix, or should be replaced in docs or recipes.

This is not a claim that V4-Pro is generally production-ready on B200. The passing path is a nightly image plus a one-line local derived layer, and the long-context cells are limited-sample validation.

Maintainer asks

  1. Is vllm/vllm-openai:deepseekv4-cu130 expected to remain supported for DeepSeek-V4, or should users move to a newer nightly or stable image?
  2. Is there a known commit or PR between deepseekv4-cu130 and aa2b56f that explains the 128K+ timeout delta?
  3. Should the import-time cupy.testing to pytest dependency be filed separately?
  4. If PR #27114 is expected to apply to V4-Pro, what recipe should be used for a PIECEWISE retest?

Environment

ComponentValue
Cloud / instanceNebius AI Cloud, single 8-GPU node b200-bench-8gpu-1
GPUs8x NVIDIA B200 SXM6
GPU architectureSM100
VRAM192 GB HBM3e per GPU
Driver580.126.09
Modeldeepseek-ai/DeepSeek-V4-Pro
Model on disk806 GB, 64 safetensors shards
Serving imagevllm/vllm-openai:nightly plus one derived layer installing pytest, local tag vllm-nightly-fix:local
Local derived image IDsha256:386e94637c58ba34efec901a678b9f3ff74cec3037513f10a1d7b58c21311b2a, created 2026-05-26T17:04:55Z
vLLM version0.21.1rc1.dev281+gaa2b56ffb
vLLM commitaa2b56f
Nightly image digestsha256:35f29a91c33e632aed24ccec68bb5cf872e1b5066eccc49ff2e406b641964e11
Python3.12.13
CUDA / PyTorchCUDA 13.0, PyTorch 2.11.0+cu130
Triton3.6.0
FlashInfer0.6.11.post2
Transformers / AccelerateTransformers 5.9.0, Accelerate 1.13.0
NCCL2.28.9 via torch.cuda.nccl.version()
OS / kernelUbuntu 24.04.4 LTS, Linux 6.11.0-1016-nvidia

Provenance: image, version, and commit rows are pulled from the nightly image labels and vllm Python package metadata inside the running container; the Python / CUDA / PyTorch / Triton / FlashInfer / Transformers / Accelerate / NCCL / OS / driver rows are captured from pip freeze, torch.cuda.nccl.version(), nvidia-smi, and uname -a on the test host. Raw capture is available with the proof artifacts on request.

The derived layer is only to satisfy a startup import path in this nightly: vllm.ir.ops.layernorm registration walks through cupy.testing, which imports pytest. No vLLM source change was made. Maintainers may want to look at moving this pytest-touching path out of import time; the derived layer would be unnecessary if the registration did not eagerly walk through cupy.testing.

This report does not bisect which upstream PR landed the V4-Pro long-context fix between the published deepseekv4-cu130 image and nightly aa2b56f. Commit aa2b56f (build 2026-05-26) is the earliest nightly we tested where the 128K+ real-prose timeout resolves.

A related counter-observation for maintainers tracking PR #27114 (the documented Blackwell FULL_AND_PIECEWISE/PIECEWISE plan/runtime mismatch at max_model_len > 131072): in this V4-Pro setup, forcing cudagraph_mode=PIECEWISE did not improve the synthetic random-token-ID probe and made corruption appear at shorter lengths. Because that probe uses synthetic random IDs, I am treating this as a counter-observation only, not a real-prose correctness failure. Real-prose workloads on FULL_AND_PIECEWISE were unaffected.

Public release cross-check, refreshed immediately before posting: the vLLM releases page still lists v0.21.0 as the latest release and v0.20.2 as a DeepSeek-V4 patch release. This draft reports the measured nightly aa2b56f result only; it does not claim the same behavior on v0.20.2 or v0.21.0.

Reporter context: independent benchmarking work on Nebius AI Cloud 8x B200 hardware. Posting as an individual contributor; not an official Nebius statement.

Closest prior art reviewed before posting: the vLLM 2026-04-24 "DeepSeek-V4 on vLLM" blog (which is the source of the published vllm/vllm-openai:deepseekv4-cu130 image), vLLM PR #27114 (Blackwell FULL_AND_PIECEWISE/PIECEWISE plan/runtime mismatch at max_model_len > 131072), and the DeepSeek-V4-Pro model card on Hugging Face. A search of the vLLM Issues and Discussions inboxes did not surface an existing report of the cu130 128K+ real-prose timeout under the published launch recipe. vLLM GitHub Discussions are pinned as no longer used, and current benchmark/performance reports are present in GitHub Issues; filing here as a performance issue matches the active GitHub venue. If a related report exists, I can cross-link it.

Launch command

docker run --gpus all --ipc=host --network host \
  --shm-size=32g \
  --ulimit nofile=1048576:1048576 \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  -e HF_HUB_OFFLINE=1 \
  -e NCCL_DEBUG=WARN \
  -e VLLM_ENGINE_READY_TIMEOUT_S=1800 \
  -v /data/models/v4-pro:/models/v4-pro:ro \
  vllm-nightly-fix:local \
  --model /models/v4-pro \
  --served-model-name deepseek-ai/DeepSeek-V4-Pro \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --enable-expert-parallel \
  --data-parallel-size 8 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --attention_config.use_fp4_indexer_cache=True \
  --host 127.0.0.1 \
  --port 8000

Request-mode caveat for chat:

{
  "chat_template_kwargs": {
    "thinking": true,
    "reasoning_effort": "high"
  },
  "max_tokens": 4096
}

The output budget matters. In a 1M-token chat spot, max_tokens=512 spent the completion budget on reasoning and returned empty final content; max_tokens=4096 returned clean final content.

Compatibility matrix

PathResultNotes
vllm/vllm-openai:deepseekv4-cu130, DeepSeek-V4 launch recipeTimed out at 128K, 512K, and 1MEach cell hit the 3600 s client deadline; n=1 per context length
vllm/vllm-openai:nightly commit aa2b56f, same launch recipeObserved clean through 1,042,080 coherent real-prose input tokensSame 8x B200 host; v1.4 correctness suite passed under Think High chat request shape; kernel evidence captured
Nightly aa2b56f, default chat request shapeNegative controlArithmetic distractor probe was 23/50; included to avoid reproduction ambiguity, not asserted as a vLLM bug
Nightly aa2b56f, explicit non-thinking chatNegative controlArithmetic distractor probe was 19/50; included to document request-shape sensitivity

Correctness and request-shape caveats

The current canonical suite is local card-format-standards v1.4: seven prompts covering factual, long-context retrieval, arithmetic, code, multi-turn recall, decode-heavy ASCII output, and JSON/tool calling.

These checks are included to prevent false reproduction failures. The default-chat and explicit non-thinking rows above are not asserted as vLLM bugs.

For chat-shaped correctness checks, the validated request shape used DeepSeek-V4 thinking mode with reasoning_effort: high:

{"chat_template_kwargs":{"thinking":true,"reasoning_effort":"high"}}

The long-context real-prose cells below are completion-style token-ID requests, not the same chat request shape.

V4-Pro nightly aa2b56f, Think High, --min-max-tokens 512:

PromptResultNotes
1 factualpassExpected 35, non-ASCII ratio 0
2 long-context retrievalpassRecovered lampshade-47 from Federalist haystack
3 arithmeticpassExpected 43, non-ASCII ratio 0
4 codepassPython fib function compiled, no forbidden imports
5 multi-turnpassRecovered Walter and marine biologist
6 decode-heavypass11256 chars, non_ascii_ratio=0.0
7 JSON / tool callpassGuided JSON pass, free JSON pass, real tool-call path pass

Suite wall time was 86.8 s. The tool-call path emitted a tool_calls array with function name record_person and arguments:

{"name":"Ada Rivers","age":34,"hobby":"tidepool photography"}

Prompt 3 was separately stress-tested because it exposed the default-chat failure mode. Under Think High it was 50/50 clean.

Long-context real-prose envelope

The headline long-context measurements use coherent Federalist text, not synthetic random token IDs. Synthetic random-token-ID prompts produced degenerate output on this model at long context and are treated as a benchmark-protocol artifact for correctness. Random-ID throughput measurements, where reported, should be read as synthetic load-generation throughput only; they are not content-equivalent to real-prose or chat throughput at the same request shape.

Pass criteria for the cells below:

  • Input: a single list of token IDs constructed from coherent Federalist Papers prose tokenized for V4-Pro (no random IDs).
  • Sampling: temperature: 0, max_tokens set so the model can return 64 tokens of continuation, max-concurrency 1, sequential.
  • Timing: wall time measured around the synchronous HTTP request.
  • "Clean" = non_ascii_ratio of the returned content is 0 and no Chinese de_count repetition is observed; the returned text is coherent Federalist continuation.
  • Sample size: each cell is n=1. The nsys row is a separate single-request capture under Nsight Systems on the same launch shape. The deepseekv4-cu130 row ran each cell to the 3600 s client deadline.

Input lengths: 128K = 128,256 tokens, 512K = 521,040 tokens, 1M = 1,042,080 tokens (the actual tokenization of the Federalist haystack at each cut).

Image / path128K wall (s)512K wall (s)1M wall (s)1M result
vLLM nightly aa2b56f, FULL_AND_PIECEWISE12.941.6148.1clean at 1,042,080 input tokens
vLLM nightly aa2b56f, nsys repeatn/an/a152.0clean at 1,042,080 input tokens
vLLM deepseekv4-cu130TIMEOUT 3600TIMEOUT 3600TIMEOUT 3600no completion

One chat-shaped 1M spot under Think High with max_tokens=4096 returned clean final content:

Prompt tokensCompletion tokensWall timeFinal contentCleanliness
1,032,189493142.85 s1301 charscjk_count_content=0, de_count_content=0, clean=true

Short-context serving checks

Short-context random-ID serving sweeps were also run and were successful, but they are not load-bearing for this issue. The relevant point is that the reported delta appears at coherent 128K+ real-prose requests rather than at basic server bring-up or short synthetic serving.

Kernel and hardware evidence

The 1M real-prose run was repeated under Nsight Systems on the same launch shape.

Startup/backend evidence includes:

  • DP/EP launch shape reached warmup across all eight ranks: logs show Worker_DP0_EP0 through Worker_DP7_EP7 and EngineCore_DP0 through EngineCore_DP7.
  • fp8 KV cache.
  • FlashInfer top-k/top-p sampling.
  • PYNCCL data-parallel / expert-parallel communication.
  • AgRsAll2AllManager.
  • DeepGEMM warmup.
  • FlashInfer trtllm_fp4_block_scale_moe autotune.
  • TileLang MHC kernels.
  • CUDA graph capture for PIECEWISE prefill/decode and FULL decode.

nsys stats --report cuda_gpu_kern_sum over the captured window includes:

  • Sparse attention.
  • deep_gemm::sm100_fp8_fp4_gemm_1d1d_impl.
  • deep_gemm::sm100_fp4_mqa_logits.
  • kernel_cutlass...IndexerQMxFp4Kernel.
  • _fused_kv_compress_norm_rope_insert_indexer_mxfp4_attn.

The B200 driver did not expose direct tensor_active_pct, so the tensor/SM sample used the local nvidia-smi dmon -s u fallback (default 1 Hz). Single 10-sample capture at 1 Hz during a 4096-token decode-heavy request: p50 SM 99%, p50 memory 27%. This is sufficient to confirm kernels are dispatching and the device is busy during decode; it is not a tensor-core utilization measurement.

What this does not claim

  • Default and explicit non-thinking chat failures are not asserted as vLLM bugs. Those modes failed the arithmetic distractor probe at high rates and are included to prevent request-shape ambiguity.
  • Based on this host and workload, I would not recommend deepseekv4-cu130 for 128K+ V4-Pro real-prose requests.
  • Tool-parser coverage is limited to a basic OpenAI-style function tool call. Complex DSML or tool-parser edge cases are not validated by the v1.4 suite.
  • Stable-tag production posture is not established. The passing path is a nightly build plus a one-line local derived image.
  • Synthetic random-ID long-context output is not meaningful correctness evidence for this model.
  • Admission-control and production quota guidance are out of scope. That needs a dedicated KV/concurrency envelope.
  • Other hardware tiers are out of scope. H200, GB200, and B300 are not tested in this report; B200 8x SXM6 is sufficient for the observed 1M real-prose path.

Reproduction checklist

  • Use the same B200-class hardware or disclose differences.
  • Use vllm/vllm-openai:nightly at commit aa2b56f or a later nightly; this report has not bisected which specific PR landed the V4-Pro long-context fix.
  • Install pytest in the derived image if this nightly import path still requires it.
  • Launch with the full DeepSeek-V4 launch recipe above.
  • For chat correctness, pass chat_template_kwargs={"thinking":true,"reasoning_effort":"high"}.
  • Allocate enough max_tokens for reasoning plus final content.
  • Run the v1.4 7-prompt suite.
  • Run at least one coherent real-prose long-context check. Do not use random token IDs for correctness.
  • If claiming kernel dispatch, capture startup log evidence plus an nsys window or equivalent GPU kernel evidence.

Follow-up checks I can run

  • Bisect which upstream vLLM PR landed the V4-Pro 128K+ real-prose fix between the cu130 image and nightly aa2b56f, so the fix can be attributed to a specific commit and tracked into a future stable tag.
  • Look at moving the import-time cupy.testing path out of vllm.ir.ops.layernorm registration so the nightly does not require an external pytest install. I can file a separate, narrowly scoped issue if maintainers want it tracked there.
  • Re-test on a later nightly (and on v0.21.0 / v0.20.2 stable tags) to confirm the 1M real-prose path holds across image revisions.
  • If PR #27114 is expected to also cover V4-Pro, the synthetic random-token-ID corruption at 32K under PIECEWISE documented above is a counter-data point worth investigating; I can rerun against any specific recipe maintainers suggest.

Filing as a single performance benchmark/compatibility report because the findings share the same hardware, image, and recipe. I can split into separate issues (cu130 cliff, pytest import path, PR #27114 / PIECEWISE on V4-Pro) if triage prefers narrower surfaces.

Proof artifacts

Available on request from the reporter (raw JSONL, logs, configs):

  • v1.4 correctness suite run: summary.json with per-prompt verdicts, results.md, per-prompt response captures, wall time 86.82 s.
  • Request-mode closure run: phase-1 matrix of default-chat vs. explicit-non-thinking vs. Think High vs. Think Max on the prompt 3 distractor task, plus the 50x prompt 3 stress under Think High.
  • 1M chat spot under Think High with max_tokens=4096: request payload, response, and usage object (prompt_tokens=1032189, completion_tokens=493, wall_sec=142.85).
  • Kernel evidence: nsys stats --report cuda_gpu_kern_sum table over a 30 s decode window, startup-backend log evidence, nvidia-smi dmon -s u decode-load sample.
  • cu130 real-prose failure: per-context-length request logs showing the 3600 s timeout at 128K, 512K, and 1M.

I can attach selected JSONL excerpts inline as code blocks in a follow-up comment, post the staged compact GitHub gist bundle for the v1.4 summary.json plus the 1M chat-spot usage object (single gist, public, no expiry), or share a tarball of the raw run directories on request (GitHub Releases asset attached to this draft account, retained 90 days). Internal canonical paths are kept off this post; if you want to see the on-box capture process, ask and I can publish a sanitized walkthrough.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Performance]: DeepSeek-V4-Pro 128K+ timeout on deepseekv4-cu130; nightly aa2b56f completes 1M real-prose checks on 8x B200