vllm - ✅(Solved) Fix DSA module construction corrupts CUDA RNG state (Offset increment outside graph capture) [5 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39371Fetched 2026-04-09 07:51:29
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
cross-referenced ×1referenced ×1

Constructing a DeepseekV2MLAAttention module for DSA (DeepSeek-V3.2 sparse MLA) leaves the CUDA graph RNG offset tracking in an active state. Any subsequent RNG operation (normal_(), uniform_(), torch.randn(), torch.randint(), etc.) on the same device fails with:

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this.

Error Message

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

Root Cause

Constructing a DeepseekV2MLAAttention module for DSA (DeepSeek-V3.2 sparse MLA) leaves the CUDA graph RNG offset tracking in an active state. Any subsequent RNG operation (normal_(), uniform_(), torch.randn(), torch.randint(), etc.) on the same device fails with:

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this.

Fix Action

Workaround

Avoid all RNG operations after DSA module construction. Use deterministic initialization (fill_(), torch.full(), torch.zeros()) instead.

PR fix notes

PR #691: fix: vLLM 0.17.0 collector + data (DSA, MLA, MoE)

Description (problem / solution / changelog)

Overview:

Fix vLLM 0.17.0 collector compatibility for DSA module, MLA kernel, and MoE MXFP4 benchmarks on B200.

Details:

DSA module collector (collect_mla_module.py):

  • Deterministic weight/tensor init — vLLM 0.17.0's FlashInfer sparse MLA backend (vllm#33451) and DSA CUDA graph support (vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG operation crashes with "Offset increment outside graph capture".
    • enforce_eager and manual_seed() do not clear the state — the corruption originates inside module construction
    • Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_()/torch.full()
    • Safe for benchmarking: kernel latency depends on shapes/dtypes, not values; dummy weights are overwritten by process_weights_after_loading() anyway
    • Filed upstream: vllm#39371
  • KV cache scale buffers — vLLM registers k_scale/v_scale as buffers, not parameters. The init loop missed them, leaving sentinel values that fail process_weights_after_loading() (k_scale > 0.0 assertion).
  • auto_map stripping — DeepSeek-V3's config.json has auto_map pointing to configuration_deepseek.py. HuggingFace's AutoConfig.from_pretrained() (called by vLLM's ModelConfig) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture.
  • MLA backend selection — vLLM 0.17.0 calls get_current_vllm_config() during get_attn_backend_cls(). Wrap in set_current_vllm_config() context.

MLA kernel-level collector (collect_mla.py):

  • Backend selection — same get_current_vllm_config() issue as DSA module collector. Wrap backend selection in set_current_vllm_config() with a temporary config.

MoE MXFP4 collector (collect_moe.py):

  • Forward context — vLLM 0.17.0's MoERunner abstraction (vllm#32344) routes FusedMoE.forward() through get_forward_context()get_layer_from_name(), requiring the module to be registered in static_forward_context. Share the same VllmConfig between FusedMoE.__init__ and the benchmark's set_forward_context() so the registration is visible.
  • pcp_size — vLLM 0.17.0 added prefill context parallel to FusedMoE (vllm#32344). Pass pcp_size=1 to avoid get_pcp_group() which requires distributed init.

Data — replace partial/dirty perf data with clean collection from job 294842392 (0 DSA/MLA module errors). Adds previously missing mla_context_module_perf.txt and mla_generation_module_perf.txt.

Known limitations:

  • 42 MoE MXFP4 weight_scale_vec_size errors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fix
  • 6 MoE MXFP4 test cases with tp_size > 1 fail at FusedMoE.__init__ — requires distributed init not available in standalone collector
  • MLA kernel-level collector may have additional errors from builder get_per_layer_parameters() lookup after backend selection is fixed — to be investigated in follow-up if needed

Where should the reviewer start?

collector/vllm/collect_mla_module.py

Changed files

  • collector/vllm/collect_mla.py (modified, +40/-29)
  • collector/vllm/collect_mla_module.py (modified, +59/-18)
  • collector/vllm/collect_moe.py (modified, +43/-2)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txt (modified, +2/-2)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txt (modified, +2/-2)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txt (added, +3/-0)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_generation_module_perf.txt (added, +3/-0)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/moe_perf.txt (modified, +2/-2)

PR #718: fix: vLLM 0.17.0 collector compat (DSA, MLA module, MoE)

Description (problem / solution / changelog)

Overview:

Fix vLLM 0.17.0 collector compatibility for DSA module, MLA module, and MoE MXFP4 benchmarks on B200. Uses version-routed v2 collector files to isolate 0.17.0 changes from existing collectors.

Details:

DSA module collector (collect_mla_module_v2.py):

  • Deterministic weight/tensor init — vLLM 0.17.0's FlashInfer sparse MLA backend (vllm#33451) and DSA CUDA graph support (vllm#34457) leave CUDA graph RNG offset tracking active after DeepseekV2MLAAttention construction. Any subsequent RNG operation crashes with "Offset increment outside graph capture".
    • enforce_eager and manual_seed() do not clear the state — the corruption originates inside module construction
    • Replace all post-construction RNG (normal_, uniform_, randn, randint) with deterministic fill_()/torch.full()
    • Safe for benchmarking: kernel latency depends on shapes/dtypes, not values; dummy weights are overwritten by process_weights_after_loading() anyway
    • Filed upstream: vllm#39371
  • KV cache scale buffers — vLLM registers k_scale/v_scale as buffers, not parameters. The init loop missed them, leaving sentinel values that fail process_weights_after_loading() (k_scale > 0.0 assertion).
  • auto_map stripping — DeepSeek-V3's config.json has auto_map pointing to configuration_deepseek.py. HuggingFace's AutoConfig.from_pretrained() (called by vLLM's ModelConfig) unconditionally tries to import it from the temp directory where it doesn't exist. Strip it; vLLM natively supports the architecture.

MoE MXFP4 collector (collect_moe_v2.py):

  • Forward context — vLLM 0.17.0's MoERunner abstraction (vllm#32344) routes FusedMoE.forward() through get_forward_context()get_layer_from_name(), requiring the module to be registered in static_forward_context. Share the same VllmConfig between FusedMoE.__init__ and the benchmark's set_forward_context() so the registration is visible.
  • pcp_size — vLLM 0.17.0 added prefill context parallel to FusedMoE (vllm#32344). Pass pcp_size=1 to avoid get_pcp_group() which requires distributed init.
  • is_gated_activation — pass is_gated_activation=True to prepare_static_weights_for_trtllm_fp4_moe() (GPT-OSS uses SwiGLU).

Version routing (registry.py):

  • moe, mla_*_module, dsa_*_module ops use VersionRoute to route to v2 files on vLLM >= 0.17.0, falling back to originals otherwise
  • Existing collector files are untouched — no backward compat risk

Data — clean collection from job 295500035 (0 DSA/MLA module errors). Adds previously missing mla_context_module_perf.txt and mla_generation_module_perf.txt.

Known limitations:

  • 42 MoE MXFP4 weight_scale_vec_size errors — FlashInfer TRTLLM FP4 kernel rejects the weight format; likely needs FlashInfer-side fix
  • 6 MoE MXFP4 test cases with tp_size > 1 fail at FusedMoE.__init__ — requires distributed init not available in standalone collector
  • MLA kernel-level collector (collect_mla.py) fix deferred — vLLM 0.17.0 changed the FlashInferMLAImpl forward API

Where should the reviewer start?

collector/vllm/registry.pycollector/vllm/collect_mla_module_v2.py

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

Summary by CodeRabbit

  • New Features

    • Added benchmarking support for vLLM 0.17.0 MLA/DSA attention modules with configurable test cases across sequence lengths, batch sizes, and quantization modes.
    • Added Mixture-of-Experts (MoE) performance benchmarking with multiple quantization backend support.
  • Improvements

    • Enabled runtime module selection based on vLLM version compatibility.
    • Updated performance baseline data for B200 SXM systems.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changed files

  • collector/vllm/collect_mla_module_v1.py (renamed, +0/-0)
  • collector/vllm/collect_mla_module_v2.py (added, +1042/-0)
  • collector/vllm/collect_moe_v1.py (renamed, +0/-0)
  • collector/vllm/collect_moe_v2.py (added, +576/-0)
  • collector/vllm/registry.py (modified, +25/-8)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_context_module_perf.txt (modified, +2/-2)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/dsa_generation_module_perf.txt (modified, +2/-2)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_context_module_perf.txt (added, +3/-0)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/mla_generation_module_perf.txt (added, +3/-0)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.17.0/moe_perf.txt (modified, +2/-2)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.19.0/gemm_perf.txt (added, +3/-0)
  • src/aiconfigurator/systems/data/b200_sxm/vllm/0.19.0/moe_perf.txt (added, +3/-0)

Code Example

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

---

import torch
from vllm.config import VllmConfig, set_current_vllm_config
from vllm.model_executor.models.deepseek_v2 import DeepseekV2MLAAttention

vllm_config = ...  # VllmConfig with DeepSeek-V3.2 model

with set_current_vllm_config(vllm_config):
    attn_module = DeepseekV2MLAAttention(
        vllm_config=vllm_config,
        config=hf_config,
        # ... DSA-specific config
        prefix="model.layers.0.self_attn",
    )

attn_module = attn_module.to("cuda:0")

# This crashes:
torch.randn(1, device="cuda:0")
# RuntimeError: Offset increment outside graph capture encountered unexpectedly.
RAW_BUFFERClick to expand / collapse

Bug: DeepseekV2MLAAttention construction leaves CUDA graph RNG offset tracking active

Description

Constructing a DeepseekV2MLAAttention module for DSA (DeepSeek-V3.2 sparse MLA) leaves the CUDA graph RNG offset tracking in an active state. Any subsequent RNG operation (normal_(), uniform_(), torch.randn(), torch.randint(), etc.) on the same device fails with:

RuntimeError: Offset increment outside graph capture encountered unexpectedly.

This happens outside vLLM's model runner — we construct standalone modules for kernel benchmarking. Regular (non-DSA) MLA modules do not trigger this.

Reproduction

import torch
from vllm.config import VllmConfig, set_current_vllm_config
from vllm.model_executor.models.deepseek_v2 import DeepseekV2MLAAttention

vllm_config = ...  # VllmConfig with DeepSeek-V3.2 model

with set_current_vllm_config(vllm_config):
    attn_module = DeepseekV2MLAAttention(
        vllm_config=vllm_config,
        config=hf_config,
        # ... DSA-specific config
        prefix="model.layers.0.self_attn",
    )

attn_module = attn_module.to("cuda:0")

# This crashes:
torch.randn(1, device="cuda:0")
# RuntimeError: Offset increment outside graph capture encountered unexpectedly.

What we tried (none of these clear the state)

  • enforce_eager=True on ModelConfig
  • torch.cuda.manual_seed(42) after construction
  • torch.cuda.synchronize() after construction

Environment

  • vLLM v0.17.0
  • NVIDIA B200 (Blackwell, SM100)
  • CUDA 13.0
  • FlashInfer sparse MLA backend

Likely cause

The FlashInfer sparse MLA backend (#33451) or DSA CUDA graph support (#34457) appears to activate CUDA graph RNG offset tracking during module construction that persists after __init__ returns. #34552 also notes DSA "has issues with cudagraphs" on Blackwell.

Workaround

Avoid all RNG operations after DSA module construction. Use deterministic initialization (fill_(), torch.full(), torch.zeros()) instead.

extent analysis

TL;DR

Avoid RNG operations after constructing DeepseekV2MLAAttention and use deterministic initialization instead to prevent CUDA graph RNG offset tracking issues.

Guidance

  • Identify all RNG operations (e.g., torch.randn(), torch.randint()) that occur after constructing DeepseekV2MLAAttention and replace them with deterministic initialization methods (e.g., fill_(), torch.full(), torch.zeros()).
  • Verify that the workaround resolves the RuntimeError: Offset increment outside graph capture encountered unexpectedly issue by testing the modified code.
  • Consider refactoring the code to construct DeepseekV2MLAAttention modules in a separate process or thread to isolate the RNG state.
  • Be cautious when using the FlashInfer sparse MLA backend and DSA CUDA graph support, as they may have ongoing issues with cudagraphs on certain hardware (e.g., Blackwell).

Example

# Replace torch.randn() with torch.zeros()
tensor = torch.zeros(1, device="cuda:0")

Notes

The provided workaround may not be suitable for all use cases, especially those requiring true randomness. Further investigation into the FlashInfer sparse MLA backend and DSA CUDA graph support issues may be necessary to develop a more robust solution.

Recommendation

Apply the workaround by using deterministic initialization methods to avoid RNG operations after constructing DeepseekV2MLAAttention, as this is the most straightforward way to prevent the CUDA graph RNG offset tracking issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING