vllm - ✅(Solved) Fix [Bug]: workspace.py rejects post-lock growth in deepseek_v4_attention._forward_prefill (DSV4 #40991) — patch attached [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41700Fetched 2026-05-06 06:15:23
View on GitHub
Comments
1
Participants
1
Timeline
9
Reactions
0
Participants
Timeline (top)
mentioned ×3subscribed ×3closed ×1commented ×1

vllm/v1/worker/workspace.py:_ensure_workspace_size locks the per-rank attention workspace at the post-profile size and refuses growth. DeepSeek-V4-Flash's _forward_prefill (added in #40991) then fails with an AssertionError on real prompts that need slightly more workspace than the dummy profile run sized for. The locked size is structural and not influenced by --max-num-batched-tokens, --max-num-seqs, or --gpu-memory-utilization. The currently published workaround is --enforce-eager, which costs ~4× decode throughput.

The smoking gun is in the source: deepseek_v4_attention.py:170-172 carries this comment:

# Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather
# workspace allocated at _forward_prefill (and the matching profile-time
# reservation in attention_impl's dummy-run branch).
PREFILL_CHUNK_SIZE = 4

The "matching profile-time reservation in attention_impl's dummy-run branch" implies a pre-allocation hook was always intended. It just isn't there: the dummy-run path (attention_impl, if not isinstance(attn_metadata, dict)) returns early without ever calling through to _forward_prefill, so warmup never sees prefill workspace requirements, and lock_workspace() fires at the post-decode-only size. The first real prefill request then crashes.

Error Message

AssertionError: Workspace is locked but allocation from 'deepseek_v4_attention.py:1457:_forward_prefill' requires 21.80 MB, current size is 21.62 MB. Workspace growth is not allowed after locking.

Root Cause

  1. Opt-in growth post-lock: keep the _locked assertion but expose _ensure_workspace_size(required_bytes, allow_growth=False). Layers that allocate variable-size scratch in non-graph-captured paths (like prefill) pass allow_growth=True. The allocator keeps the old tensor referenced (so captured graphs aren't invalidated) and allocates a new larger one alongside.
  2. Documented warmup pre-allocation hook: add a method like WorkspaceManager.warmup_reserve(...) that layers can call from a known warmup phase before lock_workspace() runs. Layers that can't run their full forward in the dummy path (because metadata is required) get an explicit place to declare their max workspace.
  3. Make attention_impl's dummy-run actually exercise prefill with synthetic but realistic attn_metadata. Heavier change but eliminates the entire class of "this layer's workspace was never sized" bugs.

Fix Action

Fix / Workaround

vllm/v1/worker/workspace.py:_ensure_workspace_size locks the per-rank attention workspace at the post-profile size and refuses growth. DeepSeek-V4-Flash's _forward_prefill (added in #40991) then fails with an AssertionError on real prompts that need slightly more workspace than the dummy profile run sized for. The locked size is structural and not influenced by --max-num-batched-tokens, --max-num-seqs, or --gpu-memory-utilization. The currently published workaround is --enforce-eager, which costs ~4× decode throughput.

Working patch

PR fix notes

PR #40991: [DSv4][Nvidia] SM12x DeepSeek V4 support

Description (problem / solution / changelog)

The PR combines https://github.com/vllm-project/vllm/pull/40929, now it's DeepGEMM free, thanks to @bbbearxyz !

UPDATE: To better aligh with Deepseek official API and the B200 code path, I made a harness to help to measure correctness, performance, and quality https://github.com/jasl/vllm-ds4-sm120-harness And I will put the latest report for people to review

Summary

This PR enables DeepSeek V4 Flash to serve on NVIDIA SM12x GPUs, tested on a 2x RTX PRO 6000 Blackwell Workstation Edition host.

The important change from the earlier prototype is that this PR no longer pins or rewrites the DeepGEMM dependency. The branch keeps vLLM's upstream DeepGEMM installer and CMake metadata intact, and implements the required SM12x runtime fallbacks in vLLM:

  • DeepSeek V4 tokenizer / parser / model integration.
  • Portable Triton sparse MLA path for SM12x.
  • fp8_ds_mla sparse MLA cache handling.
  • Sink-aware SWA + compressed sparse attention.
  • vLLM-side SM12x fallbacks for DeepSeek V4-specific DeepGEMM calls.
  • SM12x sparse indexer and paged MQA fallback kernels.
  • Guardrails so existing SM90 / SM100 optimized paths remain unchanged.

Motivation

DeepSeek V4 currently relies on kernels that are available on Hopper and datacenter Blackwell paths, but not on SM120 / SM121 workstation and consumer Blackwell GPUs. In particular, SM12x cannot directly reuse SM90 WGMMA kernels or SM100 tcgen05 kernels.

This PR adds correctness-first portable kernels for the missing SM12x pieces, then optimizes the hot sparse MLA paths enough for real serving. The result is a reviewable vLLM-side compatibility layer that does not require maintainers to accept a temporary DeepGEMM fork pin.

Scope

Included:

  • SM12x Triton sparse MLA decode and prefill paths.
  • fp8_ds_mla packed cache decode for SWA and compressed sparse candidates.
  • Sink-aware sparse attention denominator semantics.
  • SM12x local fallbacks for DeepSeek V4-specific DeepGEMM call sites.
  • Sparse indexer memory bound fixes for long prefill.
  • DeepSeek V4 tokenizer handling and tool-call parser fixes needed by the new model path.
  • Targeted correctness tests and an HTTP logprobs oracle comparator.

Not included:

  • Replacing FlashMLA on SM90 / SM100.
  • A final Tensor Core implementation for every SM12x kernel.
  • MTP speculative decoding fixes. Those are kept in a separate branch / PR.
  • Community performance experiments that are useful for evaluation but too broad for this PR.
  • Any DeepGEMM fork pin or DeepGEMM CMake / install-script rewrite.

Runtime controls

The SM12x sparse MLA path registers its environment variables in vllm.envs, so users should not see unknown-variable warnings for these knobs.

VariableDefaultMeaning
VLLM_TRITON_MLA_SPARSEauto1 forces the Triton sparse MLA path, 0 disables it. When unset, vLLM enables it on SM12x where FlashMLA sparse is unavailable.
VLLM_TRITON_MLA_SPARSE_TOPK_CHUNK_SIZE512Top-k candidate chunk size for sparse MLA accumulation. Lower values reduce transient workspace at the cost of more kernel work.
VLLM_TRITON_MLA_SPARSE_QUERY_CHUNK_SIZE256Query chunk size used by prefill sparse MLA fallback.
VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZEautoOptional decode head block override. Supported values are 1, 2, and 4; benchmarks used 4.
VLLM_TRITON_MLA_SPARSE_MATMUL_DECODEautoOptional matmul-based sparse MLA decode toggle. When unset it auto-enables on SM12x.
VLLM_TRITON_MLA_SPARSE_ALLOW_CUDAGRAPHcontext dependentAllows compile / CUDA graphs for the sparse MLA path. In the formal PR branch, unset keeps graphs for normal decode and disables them for speculative decoding; 1 forces allow, 0 disables.

Operational warning: do not set PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True with the TP=2 CUDA graph configuration used below. In local testing it made custom all-reduce fail during CUDA graph address registration. Leaving it unset avoids that failure.

Branches

Formal PR branch:

jasl/vllm@ds4-sm120
HEAD: 7a34ed538

Preview / evaluation branch with extra community performance work and MTP fixes:

jasl/vllm@ds4-sm120-full
HEAD: ab7336f21

The preview branch is not intended as the review target. It exists so users can try the broader optimization stack while this PR stays focused.

Test environment

Hardware:

Host: jasl-workstation
GPU: 2x NVIDIA RTX PRO 6000 Blackwell Workstation Edition
Compute capability: SM120
GPU memory: 95 GiB class per GPU

Software:

OS: Ubuntu, Linux 7.0.0-14-generic
CUDA toolkit: /usr/local/cuda
Python: 3.13.13
PyTorch: 2.11.0+cu130
vLLM package metadata: 0.20.1rc1.dev12+g363ffa145

Benchmark environment:

export PATH="/usr/local/cuda/bin:$PATH"
export CUDA_HOME="/usr/local/cuda"
export TRITON_PTXAS_PATH="/usr/local/cuda/bin/ptxas"
export CUDA_ARCH_LIST="120a"
export TORCH_CUDA_ARCH_LIST="12.0a"
export VLLM_TRITON_MLA_SPARSE=1
export VLLM_TRITON_MLA_SPARSE_HEAD_BLOCK_SIZE=4
export VLLM_RPC_TIMEOUT=100000
unset PYTORCH_CUDA_ALLOC_CONF

Note: DGX Spark use 121a and 12.1a

Validation

Formal PR branch checks:

python -m ruff check \
  vllm/envs.py \
  vllm/utils/deep_gemm.py \
  vllm/tokenizers/deepseek_v4_encoding.py \
  vllm/model_executor/layers/deepseek_v4_attention.py \
  vllm/v1/attention/backends/mla/sparse_mla_env.py \
  vllm/v1/attention/backends/mla/sparse_swa.py \
  tests/tokenizers_/test_deepseek_v4.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py \
  tests/v1/attention/test_sm120_deepgemm_fallbacks.py

Result:

All checks passed!

Compile check:

python -m py_compile \
  vllm/envs.py \
  vllm/utils/deep_gemm.py \
  vllm/tokenizers/deepseek_v4_encoding.py \
  vllm/v1/attention/backends/mla/sparse_mla_kernels.py \
  vllm/model_executor/layers/deepseek_v4_attention.py \
  vllm/v1/attention/backends/mla/sparse_swa.py

Targeted tests:

python -m pytest -q \
  tests/tokenizers_/test_deepseek_v4.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_sparse_mla_backends.py \
  tests/v1/attention/test_sm120_deepgemm_fallbacks.py \
  tests/v1/attention/test_sparse_attn_indexer.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py

Result:

151 passed, 504 skipped, 16 warnings in 356.93s

Diff hygiene:

git diff --check origin/main...HEAD

Result: clean.

Preview branch focused checks:

python -m ruff check \
  vllm/v1/attention/backends/mla/sparse_mla_env.py \
  vllm/model_executor/layers/deepseek_v4_attention.py \
  tests/v1/spec_decode/test_mtp.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py

python -m pytest -q \
  tests/v1/spec_decode/test_mtp.py \
  tests/v1/attention/test_sparse_mla_env.py \
  tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py

Result:

95 passed, 16 warnings in 48.35s

Serving command

Formal PR branch, no MTP:

PYTHONPATH=~/tmp/vllm-bench-ds4-sm120 \
~/tmp/vllm/.venv/bin/vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --host 127.0.0.1 \
  --port 8017 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.94 \
  --tensor-parallel-size 2 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4

Preview branch, MTP:

PYTHONPATH=~/tmp/vllm-bench-ds4-sm120-full \
~/tmp/vllm/.venv/bin/vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --host 127.0.0.1 \
  --port 8018 \
  --trust-remote-code \
  --kv-cache-dtype fp8 \
  --block-size 256 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.985 \
  --tensor-parallel-size 2 \
  --compilation-config '{"cudagraph_mode":"FULL_AND_PIECEWISE","custom_ops":["all"]}' \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

Benchmark command

The short-context benchmark uses 128 -> 512; the long-context benchmark uses 8192 -> 512. Each row uses 48 prompts and temperature=0.

~/tmp/vllm/.venv/bin/vllm bench serve \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --host 127.0.0.1 \
  --port <port> \
  --dataset-name random \
  --random-input-len <128-or-8192> \
  --random-output-len 512 \
  --num-prompts 48 \
  --max-concurrency <C> \
  --ignore-eos \
  --temperature 0 \
  --save-result \
  --result-dir <result-dir> \
  --result-filename <name>.json

Formal PR branch benchmark

Branch:

jasl/vllm@ds4-sm120
HEAD: 7a34ed538

Server memory setting:

--gpu-memory-utilization 0.94

MTP is not included in this branch. Starting the formal branch with --speculative-config '{"method":"mtp","num_speculative_tokens":2}' fails because the MTP fix stack is intentionally kept separate.

ContextConcurrencyOutput tok/sRequests/sMean TPOTMean TTFT
128 -> 5121100.380.1969.76 ms113.4 ms
128 -> 5124296.840.58013.16 ms171.9 ms
128 -> 5128478.340.93416.18 ms291.6 ms
8192 -> 512158.610.11410.94 ms3143.0 ms
8192 -> 512281.350.15915.37 ms4732.0 ms

Result directory:

/home/jasl/tmp/ds4_sm120_bench_20260429_032651

Preview branch benchmark

Branch:

jasl/vllm@ds4-sm120-full
HEAD: ab7336f21

Server memory setting:

--gpu-memory-utilization 0.985

This branch includes the separate MTP fixes and community performance patches. It is for evaluation only, not the formal PR review target.

Startup notes:

  • no-MTP CUDA graph reserve: 3.67 GiB
  • no-MTP available KV cache: 10.6 GiB
  • MTP CUDA graph reserve: 4.38 GiB
  • MTP available KV cache: 6.2 GiB
ContextConcurrencyno-MTP tok/sMTP tok/sMTP deltano-MTP TPOTMTP TPOTno-MTP TTFTMTP TTFTMTP acceptance
128 -> 5121103.03161.14+56.4%9.60 ms5.95 ms62.3 ms138.7 ms78.61%
128 -> 5124303.20326.51+7.7%12.93 ms11.47 ms145.6 ms346.0 ms80.14%
128 -> 5128473.53525.08+10.9%16.46 ms14.07 ms236.3 ms402.2 ms77.17%
8192 -> 512158.5479.17+35.2%10.81 ms6.23 ms3223.4 ms3283.6 ms81.48%
8192 -> 512280.7798.33+21.7%15.33 ms13.46 ms4843.8 ms3486.3 ms79.02%

Result directory:

/home/jasl/tmp/ds4_sm120_full_bench_20260429_041151

Review notes

Changes made before this update:

  • Removed the temporary DeepGEMM fork pin and related env bridge.
  • Removed sparse MLA diagnostic dump hooks and tests.
  • Kept runtime-facing names production-oriented; test oracle helpers remain clearly separated from serving kernels.
  • Verified there are no stale prototype DeepGEMM refs.
  • Re-signed the branch with DCO trailers.
  • Re-ran targeted tests and benchmarks after the cleanup.

Known follow-ups

  • MTP speculative decoding should be reviewed as an independent PR.
  • ds4-sm120-full can continue to carry community performance patches for public evaluation.
  • Further SM12x optimization should focus on full decode profiling across indexer, MoE, collectives, sampling, and sparse MLA rather than broadening this PR.

Changed files

  • tests/kernels/attention/test_deepgemm_attention.py (modified, +69/-20)
  • tests/kernels/moe/test_moe.py (modified, +60/-0)
  • tests/models/test_deepseek_v4_mega_moe.py (modified, +53/-2)
  • tests/models/test_deepseek_v4_pp.py (added, +9/-0)
  • tests/quantization/test_fp8_scale_parameter.py (added, +33/-0)
  • tests/quantization/test_mxfp4.py (added, +38/-0)
  • tests/reasoning/test_deepseekv3_reasoning_parser.py (modified, +36/-2)
  • tests/tokenizers_/test_deepseek_v4.py (modified, +323/-0)
  • tests/tools/test_compare_vllm_http_logprobs_oracle.py (added, +115/-0)
  • tests/v1/attention/test_deepseek_v4_sparse_mla_reference.py (added, +3162/-0)
  • tests/v1/attention/test_sm120_deepgemm_fallbacks.py (added, +245/-0)
  • tests/v1/attention/test_sparse_attn_indexer.py (added, +40/-0)
  • tests/v1/attention/test_sparse_mla_backends.py (modified, +611/-3)
  • tests/v1/attention/test_sparse_mla_env.py (added, +96/-0)
  • tests/v1/core/test_prefix_caching.py (modified, +211/-0)
  • tests/v1/executor/test_ray_utils.py (modified, +46/-0)
  • tools/compare_vllm_http_logprobs_oracle.py (added, +431/-0)
  • vllm/config/compilation.py (modified, +1/-0)
  • vllm/entrypoints/chat_utils.py (modified, +11/-0)
  • vllm/entrypoints/openai/chat_completion/batch_serving.py (modified, +5/-1)
  • vllm/entrypoints/openai/chat_completion/protocol.py (modified, +109/-11)
  • vllm/entrypoints/openai/chat_completion/serving.py (modified, +7/-1)
  • vllm/entrypoints/openai/engine/protocol.py (modified, +9/-0)
  • vllm/entrypoints/serve/render/serving.py (modified, +27/-1)
  • vllm/envs.py (modified, +41/-0)
  • vllm/model_executor/kernels/linear/scaled_mm/cutlass.py (modified, +45/-0)
  • vllm/model_executor/layers/deepseek_v4_attention.py (modified, +694/-45)
  • vllm/model_executor/layers/deepseek_v4_triton_kernels.py (added, +1282/-0)
  • vllm/model_executor/layers/fused_moe/fused_marlin_moe.py (modified, +1/-0)
  • vllm/model_executor/layers/fused_moe/layer.py (modified, +24/-5)
  • vllm/model_executor/layers/quantization/utils/fp8_utils.py (modified, +38/-11)
  • vllm/model_executor/layers/sparse_attn_indexer.py (modified, +126/-38)
  • vllm/model_executor/models/deepseek_v4.py (modified, +93/-42)
  • vllm/reasoning/__init__.py (modified, +1/-1)
  • vllm/tokenizers/deepseek_v4_encoding.py (modified, +8/-3)
  • vllm/utils/deep_gemm.py (modified, +518/-3)
  • vllm/v1/attention/backends/mla/flashmla_sparse.py (modified, +18/-0)
  • vllm/v1/attention/backends/mla/indexer.py (modified, +29/-9)
  • vllm/v1/attention/backends/mla/sparse_mla_env.py (added, +150/-0)
  • vllm/v1/attention/backends/mla/sparse_mla_kernels.py (added, +2694/-0)
  • vllm/v1/attention/backends/mla/sparse_mla_reference.py (added, +242/-0)
  • vllm/v1/attention/backends/mla/sparse_swa.py (modified, +47/-0)
  • vllm/v1/attention/ops/deepseek_v4_ops/__init__.py (modified, +6/-0)
  • vllm/v1/attention/ops/deepseek_v4_ops/cache_utils.py (modified, +203/-17)
  • vllm/v1/attention/ops/deepseek_v4_ops/fp8_einsum.py (added, +175/-0)
  • vllm/v1/core/kv_cache_coordinator.py (modified, +13/-0)
  • vllm/v1/core/kv_cache_manager.py (modified, +9/-2)
  • vllm/v1/core/single_type_kv_cache_manager.py (modified, +159/-3)
  • vllm/v1/executor/ray_utils.py (modified, +19/-26)

Code Example

# Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather
# workspace allocated at _forward_prefill (and the matching profile-time
# reservation in attention_impl's dummy-run branch).
PREFILL_CHUNK_SIZE = 4

---

vllm serve pastapaul/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name deepseek-v4-flash --trust-remote-code \
  --kv-cache-dtype fp8 --block-size 256 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --max-model-len 16384 \
  --max-num-seqs 4 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  -tp 2 --nnodes 2 \
  --master-addr <HEAD_IP> --master-port 29501 --node-rank 0
# rank 1 launches with --headless added

---

AssertionError: Workspace is locked but allocation from
'deepseek_v4_attention.py:1457:_forward_prefill' requires 21.80 MB,
current size is 21.62 MB. Workspace growth is not allowed after locking.

---

# In attention_impl, dummy-run early-return:
if not isinstance(attn_metadata, dict):
    out.zero_()
    self.mla_attn._warmup_reserve_prefill_workspace()  # ← the hook
    return
RAW_BUFFERClick to expand / collapse

Summary

vllm/v1/worker/workspace.py:_ensure_workspace_size locks the per-rank attention workspace at the post-profile size and refuses growth. DeepSeek-V4-Flash's _forward_prefill (added in #40991) then fails with an AssertionError on real prompts that need slightly more workspace than the dummy profile run sized for. The locked size is structural and not influenced by --max-num-batched-tokens, --max-num-seqs, or --gpu-memory-utilization. The currently published workaround is --enforce-eager, which costs ~4× decode throughput.

The smoking gun is in the source: deepseek_v4_attention.py:170-172 carries this comment:

# Prefill is processed in fixed-size chunks; this bounds the bf16 kv-gather
# workspace allocated at _forward_prefill (and the matching profile-time
# reservation in attention_impl's dummy-run branch).
PREFILL_CHUNK_SIZE = 4

The "matching profile-time reservation in attention_impl's dummy-run branch" implies a pre-allocation hook was always intended. It just isn't there: the dummy-run path (attention_impl, if not isinstance(attn_metadata, dict)) returns early without ever calling through to _forward_prefill, so warmup never sees prefill workspace requirements, and lock_workspace() fires at the post-decode-only size. The first real prefill request then crashes.

Repro

vllm serve pastapaul/DeepSeek-V4-Flash-W4A16-FP8 \
  --served-model-name deepseek-v4-flash --trust-remote-code \
  --kv-cache-dtype fp8 --block-size 256 \
  --tokenizer-mode deepseek_v4 \
  --tool-call-parser deepseek_v4 --enable-auto-tool-choice \
  --reasoning-parser deepseek_v4 \
  --max-model-len 16384 \
  --max-num-seqs 4 --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.92 \
  -tp 2 --nnodes 2 \
  --master-addr <HEAD_IP> --master-port 29501 --node-rank 0
# rank 1 launches with --headless added

Then send a chat completion with a ~1.3K-token prompt and max_tokens=4000. Crashes during the first prefill:

AssertionError: Workspace is locked but allocation from
'deepseek_v4_attention.py:1457:_forward_prefill' requires 21.80 MB,
current size is 21.62 MB. Workspace growth is not allowed after locking.

The 21.62 MB locked size is identical across two builds 28 vLLM commits apart (only the file line number moved 1454→1457). Real-prompt sizes range 21.80 MB → 24.89 MB. Even when --max-num-batched-tokens was raised from default → 8192, the locked workspace did not grow accordingly. Lowering --max-num-seqs to 2 made the lock TIGHTER (smaller profile worst-case), not larger.

Working patch

Implementing what the comment describes — a _warmup_reserve_prefill_workspace() method on DeepseekV4MLAAttention that calls get_simultaneous() with worst-case shapes computed from max_model_len, max_num_batched_tokens, and config constants, called from the wrapper's dummy-run early-return:

# In attention_impl, dummy-run early-return:
if not isinstance(attn_metadata, dict):
    out.zero_()
    self.mla_attn._warmup_reserve_prefill_workspace()  # ← the hook
    return

Full patch (~30 lines, single-file, idempotent string-replacement script): pasta-paul/dsv4-flash-w4a16-fp8/scripts/patch_workspace_prereserve.py.

Validation results

Same en2zh_bus_001 1,304-token prompt that crashes without the patch:

unpatched (graphs ON)unpatched (--enforce-eager)patched (graphs ON)
HTTP status500 (workspace lock crash)200200
Decode throughputcrash~3.9 tok/s~14–17 tok/s
Workspace lock errors1, engine dies00 (across full harness)

Full harness run on :warmup (patched image), TP=2 on dual DGX Spark:

  • chat-smoke quick: 4/4 PASS
  • generation-matrix (18 prompts × 3 thinking modes): 50/54 (the 4 fails are a separate GENERATION_MAX_CASE_TOKENS ceiling issue unrelated to workspace)
  • toolcall15: 41/45 (92%) — slight improvement over --enforce-eager baseline (89%)
  • oracle_compare vs B200 TP=2: 5/5 ran, alignment numbers comparable
  • gsm8k 8-shot: 95.37% (vs 92.87% on H200 reference rig)
  • humaneval_instruct pass@1: 80.49% (vs 54.27% on H200 reference rig)
  • 6+ hours continuous uptime, 0 workspace-lock errors

What an upstream fix could look like

The patch above works but is DSV4-specific. A more general fix in vllm/v1/worker/workspace.py would be one of:

  1. Opt-in growth post-lock: keep the _locked assertion but expose _ensure_workspace_size(required_bytes, allow_growth=False). Layers that allocate variable-size scratch in non-graph-captured paths (like prefill) pass allow_growth=True. The allocator keeps the old tensor referenced (so captured graphs aren't invalidated) and allocates a new larger one alongside.
  2. Documented warmup pre-allocation hook: add a method like WorkspaceManager.warmup_reserve(...) that layers can call from a known warmup phase before lock_workspace() runs. Layers that can't run their full forward in the dummy path (because metadata is required) get an explicit place to declare their max workspace.
  3. Make attention_impl's dummy-run actually exercise prefill with synthetic but realistic attn_metadata. Heavier change but eliminates the entire class of "this layer's workspace was never sized" bugs.

Cross-references

Happy to refine the patch (the worst-case constants are heuristic — top_k=8192 upper bound, etc.) and submit a PR if maintainers prefer that path.

extent analysis

TL;DR

Implement a _warmup_reserve_prefill_workspace() method on DeepseekV4MLAAttention to pre-allocate workspace before locking.

Guidance

  • Identify layers that allocate variable-size scratch in non-graph-captured paths (like prefill) and modify them to pre-allocate workspace during warmup.
  • Consider adding a WorkspaceManager.warmup_reserve(...) method for layers to declare their max workspace before lock_workspace() runs.
  • Review the attention_impl dummy-run path to ensure it exercises prefill with realistic attn_metadata to eliminate workspace sizing bugs.

Example

# In attention_impl, dummy-run early-return:
if not isinstance(attn_metadata, dict):
    out.zero_()
    self.mla_attn._warmup_reserve_prefill_workspace()  # ← the hook
    return

Notes

The provided patch is DSV4-specific, and a more general fix in vllm/v1/worker/workspace.py may be necessary for broader compatibility.

Recommendation

Apply the provided patch or implement a similar pre-allocation mechanism to prevent workspace lock crashes, as it has been validated to improve decode throughput and eliminate workspace lock errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING