vllm - ✅(Solved) Fix DSA module construction corrupts CUDA RNG state (Offset increment outside graph capture) [5 pull requests, 1 participants]

simone-chen · 2026-04-09T00:48:27Z

[vllm] Constructing a DeepseekV2MLAAttention module for DSA DeepSeek-V3.2 sparse MLA leaves the CUDA graph RNG offset tracking in an active state. Any subseque… Constructing a `DeepseekV2MLAAttention` module for DSA (DeepSeek-V3.2 sparse MLA) leaves the CUDA graph RNG offset tracking in an active state. Any subsequent RNG operation (`normal_()`, `uniform_()`, `torch.randn()`, `torch.randint()`, etc.) on the same device fails with: ``` RuntimeError: Offset increment outside graph capture encountered unexpectedly. ``` This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this. # PR #691: fix: vLLM 0.17.0 collector + data (DSA, MLA, MoE) - Repository: ai-dynamo/aiconfigurator - Author: simone-chen - State: closed | merged: False - Link: https://github.com/ai-dynamo/aiconfigurator/pull/691 ## Description (problem / solution / changelog) #### Overview: Fix vLLM 0.17.0 collector compatibility for DSA module, MLA kernel, and MoE MXFP4 benchmarks on B200. #### Details: **DSA module collector** (`collect_mla_module.py`): - **Deterministic weight/tensor init** — vLLM 0.17.0's FlashInfer sparse MLA backend ([vllm#33451](https://github.com/vllm-project/vllm/pull/33451)) and DSA CUDA graph support ([vllm#34457](https://github.com/vllm-project/vllm/pull/34457)) leave CUDA graph RNG offset tracking active after `DeepseekV2MLAAttention` construction. Any subsequent RNG operation crashes with `"Offset increment outside graph capture"`. - `enforce_eager` and `manual_seed()` do not clear the state — the corruption originates inside module construction - Replace all post-construction RNG (`normal_`, `uniform_`, `randn`, `randint`) with deterministic `fill_()`/`torch.full()` - Safe for benchmarking: kernel latency depends on shapes/dtypes, not values; dummy weights are overwritten by `process_weights_after_loading()` anyway - Filed upstream: [vllm#39371](https://github.com/vllm-project/vllm/issues/39371) - **KV cache scale buffers** — vLLM registers `k_scale`/`v_scale` as buffers, not parameters. The init loop missed them, leaving sentinel values that fail `process_weights_after_loading()` (`k_scale > 0.0` assertion). - **auto_map stripping** — DeepSeek-V3's `config.json` has `auto_map` pointing to `configuration_deepseek.py`. HuggingFace's `AutoConfig.from_pretrained()` (called by vLLM's `ModelConfig`) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture. - **MLA backend selection** — vLLM 0.17.0 calls `get_current_vllm_config()` during `get_attn_backend_cls()`. Wrap in `set_current_vllm_config()` context. **MLA kernel-level collector** (`collect_mla.py`): - **Backend selection** — same `get_current_vllm_config()` issue as DSA module collector. Wrap backend selection in `set_current_vllm_config()` with a temporary config. **MoE MXFP4 collector** (`collect_moe.py`): - **Forward context** — vLLM 0.17.0's MoERunner abstraction ([vllm#32344](https://github.com/vllm-project/vllm/pull/32344)) routes `FusedMoE.forward()` through `get_forward_context()` → `get_layer_from_name()`, requiring the module to be registered in `static_forward_context`. Share the same `VllmConfig` between `FusedMoE.__init__` and the benchmark's `set_forward_context()` so the registration is visible. - **pcp_size** — vLLM 0.17.0 added prefill context parallel to `FusedMoE` ([vllm#32344](https://github.com/vllm-project/vllm/pull/32344)). Pass `pcp_size=1` to avoid `get_pcp_group()` which requires distributed init. **Data** — replace partial/dirty perf data with clean collection from [job 294842392](https://gitlab-master.nvidia.com/dl/ai-dynamo/aic-auto-collector/-/jobs/294842392) (0 DSA/MLA module errors). Adds previously missing `mla_context_module_perf.txt` and `mla_generation_module_perf.txt`. #### Known limitations: - 42 MoE MXFP4 `weight_scale_vec_size` errors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fix - 6 MoE MXFP4 test cases with `tp_size > 1` fail at `FusedMoE.__init__` — requires distributed init not available in standalone collector - MLA kernel-level collector may have additional errors from builder `get_per_layer_parameters()` lookup after backend selection is fixed — to be investigated in follow-up if needed #### Where should the reviewer start? `collector/vllm/collect_mla_module.py` ## Changed files - `collector/vllm/collect_mla.py` (modified, +40/-29) - `collector/vllm/collect_mla_module.py` (modified, +59/-18) - `collector/vllm/collect_moe.py` (modified, +43/-2) - `src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txt` (modified, +2/-2) - `src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txt` (modified, +2/-2) - `src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txt` (added, +3/-0) - `src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_genera

vllm2026-04-09 00:48:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39371•Fetched 2026-04-09 07:51:29

View on GitHub

Comments

Participants

Timeline

Reactions

Author

simone-chen

Participants

simone-chen

Timeline (top)

cross-referenced ×1referenced ×1

Constructing a DeepseekV2MLAAttention module for DSA (DeepSeek-V3.2 sparse MLA) leaves the CUDA graph RNG offset tracking in an active state. Any subsequent RNG operation (normal_(), uniform_(), torch.randn(), torch.randint(), etc.) on the same device fails with:

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this.

Error Message

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

Root Cause

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this.

Fix Action

Workaround

Avoid all RNG operations after DSA module construction. Use deterministic initialization (fill_(), torch.full(), torch.zeros()) instead.

PR fix notes

PR #691: fix: vLLM 0.17.0 collector + data (DSA, MLA, MoE)

Repository: ai-dynamo/aiconfigurator
Author: simone-chen
State: closed | merged: False
Link: https://github.com/ai-dynamo/aiconfigurator/pull/691

Description (problem / solution / changelog)

Overview:

Fix vLLM 0.17.0 collector compatibility for DSA module, MLA kernel, and MoE MXFP4 benchmarks on B200.

Details:

DSA module collector (collect_mla_module.py):

Deterministic weight/tensor init — vLLM 0.17.0's FlashInfer sparse MLA backend (vllm#33451) and DSA CUDA graph support (vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG operation crashes with "Offset increment outside graph capture".
- enforce_eager and manual_seed() do not clear the state — the corruption originates inside module construction
- Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_()/torch.full()
- Safe for benchmarking: kernel latency depends on shapes/dtypes, not values; dummy weights are overwritten by process_weights_after_loading() anyway
- Filed upstream: vllm#39371
KV cache scale buffers — vLLM registers k_scale/v_scale as buffers, not parameters. The init loop missed them, leaving sentinel values that fail process_weights_after_loading() (k_scale > 0.0 assertion).
auto_map stripping — DeepSeek-V3's config.json has auto_map pointing to configuration_deepseek.py. HuggingFace's AutoConfig.from_pretrained() (called by vLLM's ModelConfig) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture.
MLA backend selection — vLLM 0.17.0 calls get_current_vllm_config() during get_attn_backend_cls(). Wrap in set_current_vllm_config() context.

MLA kernel-level collector (collect_mla.py):

Backend selection — same get_current_vllm_config() issue as DSA module collector. Wrap backend selection in set_current_vllm_config() with a temporary config.

MoE MXFP4 collector (collect_moe.py):

Forward context — vLLM 0.17.0's MoERunner abstraction (vllm#32344) routes FusedMoE.forward() through get_forward_context() → get_layer_from_name(), requiring the module to be registered in static_forward_context. Share the same VllmConfig between FusedMoE.__init__ and the benchmark's set_forward_context() so the registration is visible.
pcp_size — vLLM 0.17.0 added prefill context parallel to FusedMoE (vllm#32344). Pass pcp_size=1 to avoid get_pcp_group() which requires distributed init.

Data — replace partial/dirty perf data with clean collection from job 294842392 (0 DSA/MLA module errors). Adds previously missing mla_context_module_perf.txt and mla_generation_module_perf.txt.

Known limitations:

42 MoE MXFP4 weight_scale_vec_size errors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fix
6 MoE MXFP4 test cases with tp_size > 1 fail at FusedMoE.__init__ — requires distributed init not available in standalone collector
MLA kernel-level collector may have additional errors from builder get_per_layer_parameters() lookup after backend selection is fixed — to be investigated in follow-up if needed

Where should the reviewer start?

collector/vllm/collect_mla_module.py

Changed files

collector/vllm/collect_mla.py (modified, +40/-29)
collector/vllm/collect_mla_module.py (modified, +59/-18)
collector/vllm/collect_moe.py (modified, +43/-2)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txt (modified, +2/-2)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txt (modified, +2/-2)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txt (added, +3/-0)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_generation_module_perf.txt (added, +3/-0)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/moe_perf.txt (modified, +2/-2)

PR #718: fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)

Repository: ai-dynamo/aiconfigurator
Author: simone-chen
State: closed | merged: True
Link: https://github.com/ai-dynamo/aiconfigurator/pull/718

Description (problem / solution / changelog)

Overview:

Fix vLLM 0.17.0 collector compatibility for DSA module, MLA module, and MoE MXFP4 benchmarks on B200. Uses version-routed v2 collector files to isolate 0.17.0 changes from existing collectors.

Details:

DSA module collector (collect_mla_module_v2.py):

Deterministic weight/tensor init — vLLM 0.17.0's FlashInfer sparse MLA backend (vllm#33451) and DSA CUDA graph support (vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG operation crashes with "Offset increment outside graph capture".
- enforce_eager and manual_seed() do not clear the state — the corruption originates inside module construction
- Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_()/torch.full()
- Safe for benchmarking: kernel latency depends on shapes/dtypes, not values; dummy weights are overwritten by process_weights_after_loading() anyway
- Filed upstream: vllm#39371
KV cache scale buffers — vLLM registers k_scale/v_scale as buffers, not parameters. The init loop missed them, leaving sentinel values that fail process_weights_after_loading() (k_scale > 0.0 assertion).
auto_map stripping — DeepSeek-V3's config.json has auto_map pointing to configuration_deepseek.py. HuggingFace's AutoConfig.from_pretrained() (called by vLLM's ModelConfig) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture.

MoE MXFP4 collector (collect_moe_v2.py):

Forward context — vLLM 0.17.0's MoERunner abstraction (vllm#32344) routes FusedMoE.forward() through get_forward_context() → get_layer_from_name(), requiring the module to be registered in static_forward_context. Share the same VllmConfig between FusedMoE.__init__ and the benchmark's set_forward_context() so the registration is visible.
pcp_size — vLLM 0.17.0 added prefill context parallel to FusedMoE (vllm#32344). Pass pcp_size=1 to avoid get_pcp_group() which requires distributed init.
is_gated_activation — pass is_gated_activation=True to prepare_static_weights_for_trtllm_fp4_moe() (GPT-OSS uses SwiGLU).

Version routing (registry.py):

moe, mla_*_module, dsa_*_module ops use VersionRoute to route to v2 files on vLLM >= 0.17.0, falling back to originals otherwise
Existing collector files are untouched — no backward compat risk

Data — clean collection from job 295500035 (0 DSA/MLA module errors). Adds previously missing mla_context_module_perf.txt and mla_generation_module_perf.txt.

Known limitations:

42 MoE MXFP4 weight_scale_vec_size errors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fix
6 MoE MXFP4 test cases with tp_size > 1 fail at FusedMoE.__init__ — requires distributed init not available in standalone collector
MLA kernel-level collector (collect_mla.py) fix deferred — vLLM 0.17.0 changed the FlashInferMLAImpl forward API

Where should the reviewer start?

collector/vllm/registry.py → collector/vllm/collect_mla_module_v2.py

Summary by CodeRabbit

New Features
- Added benchmarking support for vLLM 0.17.0 MLA/DSA attention modules with configurable test cases across sequence lengths, batch sizes, and quantization modes.
- Added Mixture-of-Experts (MoE) performance benchmarking with multiple quantization backend support.
Improvements
- Enabled runtime module selection based on vLLM version compatibility.
- Updated performance baseline data for B200 SXM systems.

Changed files

collector/vllm/collect_mla_module_v1.py (renamed, +0/-0)
collector/vllm/collect_mla_module_v2.py (added, +1042/-0)
collector/vllm/collect_moe_v1.py (renamed, +0/-0)
collector/vllm/collect_moe_v2.py (added, +576/-0)
collector/vllm/registry.py (modified, +25/-8)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txt (modified, +2/-2)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txt (modified, +2/-2)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txt (added, +3/-0)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_generation_module_perf.txt (added, +3/-0)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/moe_perf.txt (modified, +2/-2)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.19.0/gemm_perf.txt (added, +3/-0)
src/aiconfigurator/systems/data/b200_sxm/vllm/0.19.0/moe_perf.txt (added, +3/-0)

Code Example

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

---

import torch
from vllm.config import VllmConfig, set_current_vllm_config
from vllm.model_executor.models.deepseek_v2 import DeepseekV2MLAAttention

vllm_config = ...  # VllmConfig with DeepSeek-V3.2 model

with set_current_vllm_config(vllm_config):
    attn_module = DeepseekV2MLAAttention(
        vllm_config=vllm_config,
        config=hf_config,
        # ... DSA-specific config
        prefix="model.layers.0.self_attn",
    )

attn_module = attn_module.to("cuda:0")

# This crashes:
torch.randn(1, device="cuda:0")
# RuntimeError: Offset increment outside graph capture encountered unexpectedly.

RAW_BUFFERClick to expand / collapse

Bug: DeepseekV2MLAAttention construction leaves CUDA graph RNG offset tracking active

Description

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this.

Reproduction

import torch
from vllm.config import VllmConfig, set_current_vllm_config
from vllm.model_executor.models.deepseek_v2 import DeepseekV2MLAAttention

vllm_config = ...  # VllmConfig with DeepSeek-V3.2 model

with set_current_vllm_config(vllm_config):
    attn_module = DeepseekV2MLAAttention(
        vllm_config=vllm_config,
        config=hf_config,
        # ... DSA-specific config
        prefix="model.layers.0.self_attn",
    )

attn_module = attn_module.to("cuda:0")

# This crashes:
torch.randn(1, device="cuda:0")
# RuntimeError: Offset increment outside graph capture encountered unexpectedly.

What we tried (none of these clear the state)

enforce_eager=True on ModelConfig
torch.cuda.manual_seed(42) after construction
torch.cuda.synchronize() after construction

Environment

vLLM v0.17.0
NVIDIA B200 (Blackwell, SM100)
CUDA 13.0
FlashInfer sparse MLA backend

Likely cause

The FlashInfer sparse MLA backend (#33451) or DSA CUDA graph support (#34457) appears to activate CUDA graph RNG offset tracking during module construction that persists after __init__ returns. #34552 also notes DSA "has issues with cudagraphs" on Blackwell.

Workaround

Avoid all RNG operations after DSA module construction. Use deterministic initialization (fill_(), torch.full(), torch.zeros()) instead.

extent analysis

TL;DR

Avoid RNG operations after constructing DeepseekV2MLAAttention and use deterministic initialization instead to prevent CUDA graph RNG offset tracking issues.

Guidance

Identify all RNG operations (e.g., torch.randn(), torch.randint()) that occur after constructing DeepseekV2MLAAttention and replace them with deterministic initialization methods (e.g., fill_(), torch.full(), torch.zeros()).
Verify that the workaround resolves the RuntimeError: Offset increment outside graph capture encountered unexpectedly issue by testing the modified code.
Consider refactoring the code to construct DeepseekV2MLAAttention modules in a separate process or thread to isolate the RNG state.
Be cautious when using the FlashInfer sparse MLA backend and DSA CUDA graph support, as they may have ongoing issues with cudagraphs on certain hardware (e.g., Blackwell).

Example

# Replace torch.randn() with torch.zeros()
tensor = torch.zeros(1, device="cuda:0")

Notes

The provided workaround may not be suitable for all use cases, especially those requiring true randomness. Further investigation into the FlashInfer sparse MLA backend and DSA CUDA graph support issues may be necessary to develop a more robust solution.

Recommendation

Apply the workaround by using deterministic initialization methods to avoid RNG operations after constructing DeepseekV2MLAAttention, as this is the most straightforward way to prevent the CUDA graph RNG offset tracking issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#memory optimization #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix DSA module construction corrupts CUDA RNG state (Offset increment outside graph capture) [5 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #691: fix: vLLM 0.17.0 collector + data (DSA, MLA, MoE)

Description (problem / solution / changelog)

Overview:

Details:

Known limitations:

Where should the reviewer start?

Changed files

PR #718: fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)

Description (problem / solution / changelog)

Overview:

Details:

Known limitations:

Where should the reviewer start?

Summary by CodeRabbit

Changed files

Code Example

Bug: DeepseekV2MLAAttention construction leaves CUDA graph RNG offset tracking active

Description

Reproduction

What we tried (none of these clear the state)

Environment

Likely cause

Workaround

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING