vllm - ✅(Solved) Fix [Bug]: workspace.py rejects post-lock growth in deepseek_v4_attention._forward_prefill (DSV4 #40991) — patch attached [2 pull requests, 1 comments, 1 participants]

pasta-paul · 2026-05-05T06:35:00Z

[vllm] vllm/v1/worker/workspace.py: ensure workspace size locks the per-rank attention workspace at the post-profile size and refuses growth. DeepSeek-V4-Flash… `vllm/v1/worker/workspace.py:_ensure_workspace_size` locks the per-rank attention workspace at the post-profile size and refuses growth. DeepSeek-V4-Flash's `_forward_prefill` (added in #40991) then fails with an `AssertionError` on real prompts that need slightly more workspace than the dummy profile run sized for. The locked size is structural and not influenced by `--max-num-batched-tokens`, `--max-num-seqs`, or `--gpu-memory-utilization`. The currently published workaround is `--enforce-eager`, which costs ~4× decode throughput. The smoking gun is in the source: `deepseek_v4_attention.py:170-172` carries this comment: ```python # Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather # workspace allocated at _forward_prefill (and the matching profile-time # reservation in attention_impl's dummy-run branch). PREFILL_CHUNK_SIZE = 4 ``` The "matching profile-time reservation in attention_impl's dummy-run branch" implies a pre-allocation hook was always intended. It just isn't there: the dummy-run path (`attention_impl`, `if not isinstance(attn_metadata, dict)`) returns early without ever calling through to `_forward_prefill`, so warmup never sees prefill workspace requirements, and `lock_workspace()` fires at the post-decode-only size. The first real prefill request then crashes. # PR #40991: [DSv4][Nvidia] SM12x DeepSeek V4 support - Repository: vllm-project/vllm - Author: jasl - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/40991 ## Description (problem / solution / changelog) The PR combines https://github.com/vllm-project/vllm/pull/40929, now it's DeepGEMM free, thanks to @bbbearxyz ! UPDATE: To better aligh with Deepseek official API and the B200 code path, I made a harness to help to measure correctness, performance, and quality https://github.com/jasl/vllm-ds4-sm120-harness And I will put the latest report for people to review ## Summary This PR enables DeepSeek V4 Flash to serve on NVIDIA SM12x GPUs, tested on a 2x RTX PRO 6000 Blackwell Workstation Edition host. The important change from the earlier prototype is that this PR no longer pins or rewrites the DeepGEMM dependency. The branch keeps vLLM's upstream DeepGEMM installer and CMake metadata intact, and implements the required SM12x runtime fallbacks in vLLM: - DeepSeek V4 tokenizer / parser / model integration. - Portable Triton sparse MLA path for SM12x. - `fp8_ds_mla` sparse MLA cache handling. - Sink-aware SWA + compressed sparse attention. - vLLM-side SM12x fallbacks for DeepSeek V4-specific DeepGEMM calls. - SM12x sparse indexer and paged MQA fallback kernels. - Guardrails so existing SM90 / SM100 optimized paths remain unchanged. ## Motivation DeepSeek V4 currently relies on kernels that are available on Hopper and datacenter Blackwell paths, but not on SM120 / SM121 workstation and consumer Blackwell GPUs. In particular, SM12x cannot directly reuse SM90 WGMMA kernels or SM100 tcgen05 kernels. This PR adds correctness-first portable kernels for the missing SM12x pieces, then optimizes the hot sparse MLA paths enough for real serving. The result is a reviewable vLLM-side compatibility layer that does not require maintainers to accept a temporary DeepGEMM fork pin. ## Scope Included: - SM12x Triton sparse MLA decode and prefill paths. - `fp8_ds_mla` packed cache decode for SWA and compressed sparse candidates. - Sink-aware sparse attention denominator semantics. - SM12x local fallbacks for DeepSeek V4-specific DeepGEMM call sites. - Sparse indexer memory bound fixes for long prefill. - DeepSeek V4 tokenizer handling and tool-call parser fixes needed by the new model path. - Targeted correctness tests and an HTTP logprobs oracle comparator. Not included: - Replacing FlashMLA on SM90 / SM100. - A final Tensor Core implementation for every SM12x kernel. - MTP speculative decoding fixes. Those are kept in a separate branch / PR. - Community performance experiments that are useful for evaluation but too broad for this PR. - Any DeepGEMM fork pin or DeepGEMM CMake / install-script rewrite. ## Runtime controls The SM12x sparse MLA path registers its environment variables in `vllm.envs`, so users should not see unknown-variable warnings for these knobs. | Variable | Default | Meaning | |---|---:|---| | `VLLM_TRITON_MLA_SPARSE` | auto | `1` forces the Triton sparse MLA path, `0` disables it. When unset, vLLM enables it on SM12x where FlashMLA sparse is unavailable. | | `VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE` | `512` | Top-k candidate chunk size for sparse MLA accumulation. Lower values reduce transient workspace at the cost of more kernel work. | | `VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE` | `256` | Query chunk size used by prefill sparse MLA fallback. | | `VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE` | auto | Optional decode head block overri

vllm2026-05-05 06:35:00

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41700•Fetched 2026-05-06 06:15:23

View on GitHub

Comments

Participants

Timeline

Reactions

Author

pasta-paul

Participants

pasta-paul

Timeline (top)

mentioned ×3subscribed ×3closed ×1commented ×1

vllm/v1/worker/workspace.py:_ensure_workspace_size locks the per-rank attention workspace at the post-profile size and refuses growth. DeepSeek-V4-Flash's _forward_prefill (added in #40991) then fails with an AssertionError on real prompts that need slightly more workspace than the dummy profile run sized for. The locked size is structural and not influenced by --max-num-batched-tokens, --max-num-seqs, or --gpu-memory-utilization. The currently published workaround is --enforce-eager, which costs ~4× decode throughput.

The smoking gun is in the source: deepseek_v4_attention.py:170-172 carries this comment:

# Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather
# workspace allocated at _forward_prefill (and the matching profile-time
# reservation in attention_impl's dummy-run branch).
PREFILL_CHUNK_SIZE = 4

The "matching profile-time reservation in attention_impl's dummy-run branch" implies a pre-allocation hook was always intended. It just isn't there: the dummy-run path (attention_impl, if not isinstance(attn_metadata, dict)) returns early without ever calling through to _forward_prefill, so warmup never sees prefill workspace requirements, and lock_workspace() fires at the post-decode-only size. The first real prefill request then crashes.

Error Message

AssertionError: Workspace is locked but allocation from 'deepseek_v4_attention.py:1457:_forward_prefill' requires 21.80 MB, current size is 21.62 MB. Workspace growth is not allowed after locking.

Root Cause

Opt-in growth post-lock: keep the _locked assertion but expose _ensure_workspace_size(required_bytes, allow_growth=False). Layers that allocate variable-size scratch in non-graph-captured paths (like prefill) pass allow_growth=True. The allocator keeps the old tensor referenced (so captured graphs aren't invalidated) and allocates a new larger one alongside.
Documented warmup pre-allocation hook: add a method like WorkspaceManager.warmup_reserve(...) that layers can call from a known warmup phase before lock_workspace() runs. Layers that can't run their full forward in the dummy path (because metadata is required) get an explicit place to declare their max workspace.
Make attention_impl's dummy-run actually exercise prefill with synthetic but realistic attn_metadata. Heavier change but eliminates the entire class of "this layer's workspace was never sized" bugs.

Fix Action

Fix / Workaround

vLLM: jasl/vllm@77bbc16 (current ds4-sm120 tip; same bug on 428e08e)
Cherry-pick: neuralmagic/vllm@kylesayrs/deepseek-ct@f910a73a93 (PR #41276 work)
Patch: packed_modules_mapping injected on DeepseekV4ForCausalLM (see pasta-paul/dsv4-flash-w4a16-fp8/scripts/patch_v4_packed_mapping.py)
transformers: 5.8.0.dev0 (HF main; PR #45643 add-deepseek-v4 was merged 2026-05-02)
compressed-tensors: 0.15.1.a20260428
Hardware: 2× NVIDIA DGX Spark GB10 (SM 12.1a, 121 GiB UMA), TP=2 over QSFP RDMA
Model: pastapaul/DeepSeek-V4-Flash-W4A16-FP8 (W4A16 INT4 GPTQ experts + FP8_BLOCK attention + BF16 shared experts)

Working patch

PR fix notes

PR #40991: [DSv4][Nvidia] SM12x DeepSeek V4 support

Repository: vllm-project/vllm
Author: jasl
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/40991

Description (problem / solution / changelog)

The PR combines https://github.com/vllm-project/vllm/pull/40929, now it's DeepGEMM free, thanks to @bbbearxyz !

UPDATE: To better aligh with Deepseek official API and the B200 code path, I made a harness to help to measure correctness, performance, and quality https://github.com/jasl/vllm-ds4-sm120-harness And I will put the latest report for people to review

Summary

This PR enables DeepSeek V4 Flash to serve on NVIDIA SM12x GPUs, tested on a 2x RTX PRO 6000 Blackwell Workstation Edition host.

The important change from the earlier prototype is that this PR no longer pins or rewrites the DeepGEMM dependency. The branch keeps vLLM's upstream DeepGEMM installer and CMake metadata intact, and implements the required SM12x runtime fallbacks in vLLM:

DeepSeek V4 tokenizer / parser / model integration.
Portable Triton sparse MLA path for SM12x.
fp8_ds_mla sparse MLA cache handling.
Sink-aware SWA + compressed sparse attention.
vLLM-side SM12x fallbacks for DeepSeek V4-specific DeepGEMM calls.
SM12x sparse indexer and paged MQA fallback kernels.
Guardrails so existing SM90 / SM100 optimized paths remain unchanged.

Motivation

DeepSeek V4 currently relies on kernels that are available on Hopper and datacenter Blackwell paths, but not on SM120 / SM121 workstation and consumer Blackwell GPUs. In particular, SM12x cannot directly reuse SM90 WGMMA kernels or SM100 tcgen05 kernels.

This PR adds correctness-first portable kernels for the missing SM12x pieces, then optimizes the hot sparse MLA paths enough for real serving. The result is a reviewable vLLM-side compatibility layer that does not require maintainers to accept a temporary DeepGEMM fork pin.

Scope

Included:

SM12x Triton sparse MLA decode and prefill paths.
fp8_ds_mla packed cache decode for SWA and compressed sparse candidates.
Sink-aware sparse attention denominator semantics.
SM12x local fallbacks for DeepSeek V4-specific DeepGEMM call sites.
Sparse indexer memory bound fixes for long prefill.
DeepSeek V4 tokenizer handling and tool-call parser fixes needed by the new model path.
Targeted correctness tests and an HTTP logprobs oracle comparator.

Not included:

Replacing FlashMLA on SM90 / SM100.
A final Tensor Core implementation for every SM12x kernel.
MTP speculative decoding fixes. Those are kept in a separate branch / PR.
Community performance experiments that are useful for evaluation but too broad for this PR.
Any DeepGEMM fork pin or DeepGEMM CMake / install-script rewrite.

Runtime controls

The SM12x sparse MLA path registers its environment variables in vllm.envs, so users should not see unknown-variable warnings for these knobs.

Variable	Default	Meaning
`VLLM_TRITON_MLA_SPARSE`	auto	`1` forces the Triton sparse MLA path, `0` disables it. When unset, vLLM enables it on SM12x where FlashMLA sparse is unavailable.
`VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE`	`512`	Top-k candidate chunk size for sparse MLA accumulation. Lower values reduce transient workspace at the cost of more kernel work.
`VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE`	`256`	Query chunk size used by prefill sparse MLA fallback.
`VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE`	auto	Optional decode head block override. Supported values are `1`, `2`, and `4`; benchmarks used `4`.
`VLLM_TRITON_MLA_SPARSE_MATMUL_DECODE`	auto	Optional matmul-based sparse MLA decode toggle. When unset it auto-enables on SM12x.
`VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPH`	context dependent	Allows compile / CUDA graphs for the sparse MLA path. In the formal PR branch, unset keeps graphs for normal decode and disables them for speculative decoding; `1` forces allow, `0` disables.

Operational warning: do not set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True with the TP=2 CUDA graph configuration used below. In local testing it made custom all-reduce fail during CUDA graph address registration. Leaving it unset avoids that failure.

Branches

Formal PR branch:

jasl/vllm@ds4-sm120
HEAD: 7a34ed538

Preview / evaluation branch with extra community performance work and MTP fixes:

jasl/vllm@ds4-sm120-full
HEAD: ab7336f21

The preview branch is not intended as the review target. It exists so users can try the broader optimization stack while this PR stays focused.

Test environment

Hardware:

Host: jasl-workstation
GPU: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Compute capability: SM120
GPU memory: 95 GiB class per GPU

Software:

OS: Ubuntu, Linux 7.0.0-14-generic
CUDA toolkit: /usr/local/cuda
Python: 3.13.13
PyTorch: 2.11.0+cu130
vLLM package metadata: 0.20.1rc1.dev12+g363ffa145

Benchmark environment:

export PATH="/usr/local/cuda/bin:$PATH"
export CUDA_HOME="/usr/local/cuda"
export TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas"
export CUDA_ARCH_LIST="120a"
export TORCH_CUDA_ARCH_LIST="12.0a"
export VLLM_TRITON_MLA_SPARSE=1
export VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4
export VLLM_RPC_TIMEOUT=100000
unset PYTORCH_CUDA_ALLOC_CONF

Note: DGX Spark use 121a and 12.1a

Validation

Formal PR branch checks:

python -m ruff check \
  vllm/envs.py \
  vllm/utils/deep_gemm.py \
  vllm/tokenizers/deepseek_v4_encoding.py \
  vllm/model_executor/layers/deepseek_v4_attention.py \
  vllm/v1/attention/backends/mla/sparse_mla_env.py \
  vllm/v1/attention/backends/mla/sparse_swa.py \
  tests/tokenizers_/test_deepseek_v4.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py \
  tests/v1/attention/test_sm120_deepgemm_fallbacks.py

Result:

All checks passed!

Compile check:

python -m py_compile \
  vllm/envs.py \
  vllm/utils/deep_gemm.py \
  vllm/tokenizers/deepseek_v4_encoding.py \
  vllm/v1/attention/backends/mla/sparse_mla_kernels.py \
  vllm/model_executor/layers/deepseek_v4_attention.py \
  vllm/v1/attention/backends/mla/sparse_swa.py

Targeted tests:

python -m pytest -q \
  tests/tokenizers_/test_deepseek_v4.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_sparse_mla_backends.py \
  tests/v1/attention/test_sm120_deepgemm_fallbacks.py \
  tests/v1/attention/test_sparse_attn_indexer.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py

Result:

151 passed, 504 skipped, 16 warnings in 356.93s

Diff hygiene:

git diff --check origin/main...HEAD

Result: clean.

Preview branch focused checks:

python -m ruff check \
  vllm/v1/attention/backends/mla/sparse_mla_env.py \
  vllm/model_executor/layers/deepseek_v4_attention.py \
  tests/v1/spec_decode/test_mtp.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py

python -m pytest -q \
  tests/v1/spec_decode/test_mtp.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py

Result:

95 passed, 16 warnings in 48.35s

Serving command

Formal PR branch, no MTP:

PYTHONPATH=~/tmp/vllm-bench-ds4-sm120 \
~/tmp/vllm/.venv/bin/vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --host 127.0.0.1 \
  --port 8017 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.94 \
  --tensor-parallel-size 2 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4

Preview branch, MTP:

PYTHONPATH=~/tmp/vllm-bench-ds4-sm120-full \
~/tmp/vllm/.venv/bin/vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --host 127.0.0.1 \
  --port 8018 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.985 \
  --tensor-parallel-size 2 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Benchmark command

The short-context benchmark uses 128 -> 512; the long-context benchmark uses 8192 -> 512. Each row uses 48 prompts and temperature=0.

~/tmp/vllm/.venv/bin/vllm bench serve \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --host 127.0.0.1 \
  --port <port> \
  --dataset-name random \
  --random-input-len <128-or-8192> \
  --random-output-len 512 \
  --num-prompts 48 \
  --max-concurrency <C> \
  --ignore-eos \
  --temperature 0 \
  --save-result \
  --result-dir <result-dir> \
  --result-filename <name>.json

Formal PR branch benchmark

Branch:

jasl/vllm@ds4-sm120
HEAD: 7a34ed538

Server memory setting:

--gpu-memory-utilization 0.94

MTP is not included in this branch. Starting the formal branch with --speculative-config '{"method":"mtp","num_speculative_tokens":2}' fails because the MTP fix stack is intentionally kept separate.

Context	Concurrency	Output tok/s	Requests/s	Mean TPOT	Mean TTFT
128 -> 512	1	100.38	0.196	9.76 ms	113.4 ms
128 -> 512	4	296.84	0.580	13.16 ms	171.9 ms
128 -> 512	8	478.34	0.934	16.18 ms	291.6 ms
8192 -> 512	1	58.61	0.114	10.94 ms	3143.0 ms
8192 -> 512	2	81.35	0.159	15.37 ms	4732.0 ms

Result directory:

/home/jasl/tmp/ds4_sm120_bench_20260429_032651

Preview branch benchmark

Branch:

jasl/vllm@ds4-sm120-full
HEAD: ab7336f21

Server memory setting:

--gpu-memory-utilization 0.985

This branch includes the separate MTP fixes and community performance patches. It is for evaluation only, not the formal PR review target.

Startup notes:

no-MTP CUDA graph reserve: 3.67 GiB
no-MTP available KV cache: 10.6 GiB
MTP CUDA graph reserve: 4.38 GiB
MTP available KV cache: 6.2 GiB

Context	Concurrency	no-MTP tok/s	MTP tok/s	MTP delta	no-MTP TPOT	MTP TPOT	no-MTP TTFT	MTP TTFT	MTP acceptance
128 -> 512	1	103.03	161.14	+56.4%	9.60 ms	5.95 ms	62.3 ms	138.7 ms	78.61%
128 -> 512	4	303.20	326.51	+7.7%	12.93 ms	11.47 ms	145.6 ms	346.0 ms	80.14%
128 -> 512	8	473.53	525.08	+10.9%	16.46 ms	14.07 ms	236.3 ms	402.2 ms	77.17%
8192 -> 512	1	58.54	79.17	+35.2%	10.81 ms	6.23 ms	3223.4 ms	3283.6 ms	81.48%
8192 -> 512	2	80.77	98.33	+21.7%	15.33 ms	13.46 ms	4843.8 ms	3486.3 ms	79.02%

Result directory:

/home/jasl/tmp/ds4_sm120_full_bench_20260429_041151

Review notes

Changes made before this update:

Removed the temporary DeepGEMM fork pin and related env bridge.
Removed sparse MLA diagnostic dump hooks and tests.
Kept runtime-facing names production-oriented; test oracle helpers remain clearly separated from serving kernels.
Verified there are no stale prototype DeepGEMM refs.
Re-signed the branch with DCO trailers.
Re-ran targeted tests and benchmarks after the cleanup.

Known follow-ups

MTP speculative decoding should be reviewed as an independent PR.
ds4-sm120-full can continue to carry community performance patches for public evaluation.
Further SM12x optimization should focus on full decode profiling across indexer, MoE, collectives, sampling, and sparse MLA rather than broadening this PR.

Changed files

tests/kernels/attention/test_deepgemm_attention.py (modified, +69/-20)
tests/kernels/moe/test_moe.py (modified, +60/-0)
tests/models/test_deepseek_v4_mega_moe.py (modified, +53/-2)
tests/models/test_deepseek_v4_pp.py (added, +9/-0)
tests/quantization/test_fp8_scale_parameter.py (added, +33/-0)
tests/quantization/test_mxfp4.py (added, +38/-0)
tests/reasoning/test_deepseekv3_reasoning_parser.py (modified, +36/-2)
tests/tokenizers_/test_deepseek_v4.py (modified, +323/-0)
tests/tools/test_compare_vllm_http_logprobs_oracle.py (added, +115/-0)
tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py (added, +3162/-0)
tests/v1/attention/test_sm120_deepgemm_fallbacks.py (added, +245/-0)
tests/v1/attention/test_sparse_attn_indexer.py (added, +40/-0)
tests/v1/attention/test_sparse_mla_backends.py (modified, +611/-3)
tests/v1/attention/test_sparse_mla_env.py (added, +96/-0)
tests/v1/core/test_prefix_caching.py (modified, +211/-0)
tests/v1/executor/test_ray_utils.py (modified, +46/-0)
tools/compare_vllm_http_logprobs_oracle.py (added, +431/-0)
vllm/config/compilation.py (modified, +1/-0)
vllm/entrypoints/chat_utils.py (modified, +11/-0)
vllm/entrypoints/openai/chat_completion/batch_serving.py (modified, +5/-1)
vllm/entrypoints/openai/chat_completion/protocol.py (modified, +109/-11)
vllm/entrypoints/openai/chat_completion/serving.py (modified, +7/-1)
vllm/entrypoints/openai/engine/protocol.py (modified, +9/-0)
vllm/entrypoints/serve/render/serving.py (modified, +27/-1)
vllm/envs.py (modified, +41/-0)
vllm/model_executor/kernels/linear/scaled_mm/cutlass.py (modified, +45/-0)
vllm/model_executor/layers/deepseek_v4_attention.py (modified, +694/-45)
vllm/model_executor/layers/deepseek_v4_triton_kernels.py (added, +1282/-0)
vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +1/-0)
vllm/model_executor/layers/fused_moe/layer.py (modified, +24/-5)
vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +38/-11)
vllm/model_executor/layers/sparse_attn_indexer.py (modified, +126/-38)
vllm/model_executor/models/deepseek_v4.py (modified, +93/-42)
vllm/reasoning/__init__.py (modified, +1/-1)
vllm/tokenizers/deepseek_v4_encoding.py (modified, +8/-3)
vllm/utils/deep_gemm.py (modified, +518/-3)
vllm/v1/attention/backends/mla/flashmla_sparse.py (modified, +18/-0)
vllm/v1/attention/backends/mla/indexer.py (modified, +29/-9)
vllm/v1/attention/backends/mla/sparse_mla_env.py (added, +150/-0)
vllm/v1/attention/backends/mla/sparse_mla_kernels.py (added, +2694/-0)
vllm/v1/attention/backends/mla/sparse_mla_reference.py (added, +242/-0)
vllm/v1/attention/backends/mla/sparse_swa.py (modified, +47/-0)
vllm/v1/attention/ops/deepseek_v4_ops/__init__.py (modified, +6/-0)
vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py (modified, +203/-17)
vllm/v1/attention/ops/deepseek_v4_ops/fp8_einsum.py (added, +175/-0)
vllm/v1/core/kv_cache_coordinator.py (modified, +13/-0)
vllm/v1/core/kv_cache_manager.py (modified, +9/-2)
vllm/v1/core/single_type_kv_cache_manager.py (modified, +159/-3)
vllm/v1/executor/ray_utils.py (modified, +19/-26)

Code Example

# Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather
# workspace allocated at _forward_prefill (and the matching profile-time
# reservation in attention_impl's dummy-run branch).
PREFILL_CHUNK_SIZE = 4

---

vllm serve pastapaul/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name deepseek-v4-flash --trust-remote-code \
  --kv-cache-dtype fp8 --block-size 256 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --max-model-len 16384 \
  --max-num-seqs 4 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  -tp 2 --nnodes 2 \
  --master-addr <HEAD_IP> --master-port 29501 --node-rank 0
# rank 1 launches with --headless added

---

AssertionError: Workspace is locked but allocation from
'deepseek_v4_attention.py:1457:_forward_prefill' requires 21.80 MB,
current size is 21.62 MB. Workspace growth is not allowed after locking.

---

# In attention_impl, dummy-run early-return:
if not isinstance(attn_metadata, dict):
    out.zero_()
    self.mla_attn._warmup_reserve_prefill_workspace()  # ← the hook
    return

RAW_BUFFERClick to expand / collapse

Summary

The smoking gun is in the source: deepseek_v4_attention.py:170-172 carries this comment:

# Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather
# workspace allocated at _forward_prefill (and the matching profile-time
# reservation in attention_impl's dummy-run branch).
PREFILL_CHUNK_SIZE = 4

Repro

vLLM: jasl/vllm@77bbc16 (current ds4-sm120 tip; same bug on 428e08e)
Cherry-pick: neuralmagic/vllm@kylesayrs/deepseek-ct@f910a73a93 (PR #41276 work)
Patch: packed_modules_mapping injected on DeepseekV4ForCausalLM (see pasta-paul/dsv4-flash-w4a16-fp8/scripts/patch_v4_packed_mapping.py)
transformers: 5.8.0.dev0 (HF main; PR #45643 add-deepseek-v4 was merged 2026-05-02)
compressed-tensors: 0.15.1.a20260428
Hardware: 2× NVIDIA DGX Spark GB10 (SM 12.1a, 121 GiB UMA), TP=2 over QSFP RDMA
Model: pastapaul/DeepSeek-V4-Flash-W4A16-FP8 (W4A16 INT4 GPTQ experts + FP8_BLOCK attention + BF16 shared experts)

vllm serve pastapaul/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name deepseek-v4-flash --trust-remote-code \
  --kv-cache-dtype fp8 --block-size 256 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --max-model-len 16384 \
  --max-num-seqs 4 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  -tp 2 --nnodes 2 \
  --master-addr <HEAD_IP> --master-port 29501 --node-rank 0
# rank 1 launches with --headless added

Then send a chat completion with a ~1.3K-token prompt and max_tokens=4000. Crashes during the first prefill:

AssertionError: Workspace is locked but allocation from
'deepseek_v4_attention.py:1457:_forward_prefill' requires 21.80 MB,
current size is 21.62 MB. Workspace growth is not allowed after locking.

The 21.62 MB locked size is identical across two builds 28 vLLM commits apart (only the file line number moved 1454→1457). Real-prompt sizes range 21.80 MB → 24.89 MB. Even when --max-num-batched-tokens was raised from default → 8192, the locked workspace did not grow accordingly. Lowering --max-num-seqs to 2 made the lock TIGHTER (smaller profile worst-case), not larger.

Working patch

Implementing what the comment describes — a _warmup_reserve_prefill_workspace() method on DeepseekV4MLAAttention that calls get_simultaneous() with worst-case shapes computed from max_model_len, max_num_batched_tokens, and config constants, called from the wrapper's dummy-run early-return:

# In attention_impl, dummy-run early-return:
if not isinstance(attn_metadata, dict):
    out.zero_()
    self.mla_attn._warmup_reserve_prefill_workspace()  # ← the hook
    return

Full patch (~30 lines, single-file, idempotent string-replacement script): pasta-paul/dsv4-flash-w4a16-fp8/scripts/patch_workspace_prereserve.py.

Validation results

Same en2zh_bus_001 1,304-token prompt that crashes without the patch:

	unpatched (graphs ON)	unpatched (`--enforce-eager`)	patched (graphs ON)
HTTP status	500 (workspace lock crash)	200	200
Decode throughput	crash	~3.9 tok/s	~14–17 tok/s
Workspace lock errors	1, engine dies	0	0 (across full harness)

Full harness run on :warmup (patched image), TP=2 on dual DGX Spark:

chat-smoke quick: 4/4 PASS
generation-matrix (18 prompts × 3 thinking modes): 50/54 (the 4 fails are a separate GENERATION_MAX_CASE_TOKENS ceiling issue unrelated to workspace)
toolcall15: 41/45 (92%) — slight improvement over --enforce-eager baseline (89%)
oracle_compare vs B200 TP=2: 5/5 ran, alignment numbers comparable
gsm8k 8-shot: 95.37% (vs 92.87% on H200 reference rig)
humaneval_instruct pass@1: 80.49% (vs 54.27% on H200 reference rig)
6+ hours continuous uptime, 0 workspace-lock errors

What an upstream fix could look like

The patch above works but is DSV4-specific. A more general fix in vllm/v1/worker/workspace.py would be one of:

Opt-in growth post-lock: keep the _locked assertion but expose _ensure_workspace_size(required_bytes, allow_growth=False). Layers that allocate variable-size scratch in non-graph-captured paths (like prefill) pass allow_growth=True. The allocator keeps the old tensor referenced (so captured graphs aren't invalidated) and allocates a new larger one alongside.
Documented warmup pre-allocation hook: add a method like WorkspaceManager.warmup_reserve(...) that layers can call from a known warmup phase before lock_workspace() runs. Layers that can't run their full forward in the dummy path (because metadata is required) get an explicit place to declare their max workspace.
Make attention_impl's dummy-run actually exercise prefill with synthetic but realistic attn_metadata. Heavier change but eliminates the entire class of "this layer's workspace was never sized" bugs.

Cross-references

DSV4 main vLLM PR: #40991 (cc @jasl)
compressed-tensors V4 attention PR: #41276 (cc @kylesayrs)
Related workspace allocation failure (different scenario, same workspace.py): #40791
Full Spark TP=2 validation report (build provenance, harness PASS rates, B200 oracle alignment, decode timings): https://github.com/pasta-paul/dsv4-flash-w4a16-fp8/blob/main/findings/spark_tp2_deployment.md
The patch script: https://github.com/pasta-paul/dsv4-flash-w4a16-fp8/blob/main/scripts/patch_workspace_prereserve.py

Happy to refine the patch (the worst-case constants are heuristic — top_k=8192 upper bound, etc.) and submit a PR if maintainers prefer that path.

extent analysis

TL;DR

Implement a _warmup_reserve_prefill_workspace() method on DeepseekV4MLAAttention to pre-allocate workspace before locking.

Guidance

Identify layers that allocate variable-size scratch in non-graph-captured paths (like prefill) and modify them to pre-allocate workspace during warmup.
Consider adding a WorkspaceManager.warmup_reserve(...) method for layers to declare their max workspace before lock_workspace() runs.
Review the attention_impl dummy-run path to ensure it exercises prefill with realistic attn_metadata to eliminate workspace sizing bugs.

Example

# In attention_impl, dummy-run early-return:
if not isinstance(attn_metadata, dict):
    out.zero_()
    self.mla_attn._warmup_reserve_prefill_workspace()  # ← the hook
    return

Notes

The provided patch is DSV4-specific, and a more general fix in vllm/v1/worker/workspace.py may be necessary for broader compatibility.

Recommendation

Apply the provided patch or implement a similar pre-allocation mechanism to prevent workspace lock crashes, as it has been validated to improve decode throughput and eliminate workspace lock errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #cache error #pipeline error #runtime error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug]: workspace.py rejects post-lock growth in deepseek_v4_attention._forward_prefill (DSV4 #40991) — patch attached [2 pull requests, 1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Working patch

PR fix notes

PR #40991: [DSv4][Nvidia] SM12x DeepSeek V4 support

Description (problem / solution / changelog)

Summary

Motivation

Scope

Runtime controls

Branches

Test environment

Validation

Serving command

Benchmark command

Formal PR branch benchmark

Preview branch benchmark

Review notes

Known follow-ups

Changed files

Code Example

Summary

Repro

Working patch

Validation results

What an upstream fix could look like

Cross-references

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING