vllm - 💡(How to fix) Fix [RFC]: Introducing State Management in vLLM IR System

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

This RFC proposes IrState, a global registry for managing implementation-level state (dynamic quantization stats, workspace buffers, NCCL handles, etc.) in the vLLM IR system. IR ops remain semantically stateless; state is registered and accessed via ir_state.get().

Root Cause

This RFC proposes IrState, a global registry for managing implementation-level state (dynamic quantization stats, workspace buffers, NCCL handles, etc.) in the vLLM IR system. IR ops remain semantically stateless; state is registered and accessed via ir_state.get().

Code Example

import contextvars

_ir_context = contextvars.ContextVar("ir_context", default={})


class IrContext:
    """Manages scope information in IR execution context."""
    
    @staticmethod
    def set_scope(key: str, value: str) -> contextvars.Token:
        current = _ir_context.get()
        return _ir_context.set({**current, key: value})
    
    @staticmethod
    def get_scope(key: str) -> str | None:
        return _ir_context.get().get(key)
    
    @staticmethod
    def reset(token: contextvars.Token) -> None:
        _ir_context.reset(token)


class IrState:
    """Global registry for implementation-level state, scoped by context."""
    
    def __init__(self):
        self._registry: dict[str, Callable] = {}
        self._cache: dict[tuple[str, str], Any] = {}
    
    def register_type(self, name: str, create_fn: Callable) -> None:
        self._registry[name] = create_fn
    
    def get(self, name: str, scope: str | None = None) -> Any:
        scope = scope or "default"
        key = (scope, name)
        if key not in self._cache:
            if name not in self._registry:
                raise KeyError(f"State type not registered: {name}")
            self._cache[key] = self._registry[name]()
        return self._cache[key]

---

# === 1. Registration (at model init) ===
ir_state.register_type("workspace", lambda: torch.empty(128*1024*1024, dtype=torch.uint8, device="cuda"))
ir_state.register_type("quant_amax", lambda: create_amax_state())

# === 2. Model layer sets scope before calling IR op ===
class MyLayer(nn.Module):
    def __init__(self, layer_idx):
        self.scope = f"layer_{layer_idx}"

    def forward(self, x):
        token = IrContext.set_scope("layer", self.scope)
        try:
            return ir.ops.quant_op(x)
        finally:
            IrContext.reset(token)

# === 3. In IR implementation ===
@ir.ops.quant_op.register_impl("impl_a")
def impl_a(x):
    scope = IrContext.get_scope("layer")
    workspace = ir_state.get("workspace", scope=scope)
    amax = ir_state.get("quant_amax", scope=scope)
    return process(x, workspace, amax)

---

┌─────────────────────────────────────┐
IrState (Global)│  ┌─────────────────────────────┐    │
│  │ "quant_state" → dict        │    │
│  │ "workspace"Tensor        │    │
│  └─────────────────────────────┘    │
└──────────────┬──────────────────────┘
┌──────────────┼──────────────────────┐
Impl A ─────┘ → get("quant_state")Impl B ─────┘ → get("quant_state")Impl C ─────┘ → get("workspace")└─────────────────────────────────────┘
RAW_BUFFERClick to expand / collapse

Motivation.

RFC: Introducing State Management in vLLM IR System

Disclaimer: This document represents my personal thoughts and explorations on how state management could be integrated into the vLLM IR system. I share these ideas humbly for community discussion and am fully open to feedback, alternative approaches, and any decisions the community makes. The final design should reflect the collective wisdom of the vLLM community.

Summary

This RFC proposes IrState, a global registry for managing implementation-level state (dynamic quantization stats, workspace buffers, NCCL handles, etc.) in the vLLM IR system. IR ops remain semantically stateless; state is registered and accessed via ir_state.get().

Motivation

The vLLM IR is designed stateless for deferred kernel selection and compiler analysis. However, many algorithms need state:

Use CaseState Needed
Dynamic Quantizationamax statistics, scaling factors
Flash AttentionInternal buffers, workspace memory
NCCL CollectivesCommunicator handles
CUDA GraphGraph cache

Without formal state management, implementations either pass state through model parameters (mixing concerns), use ad-hoc globals, or recreate state every time.

Goals: Express stateful algorithms cleanly, separate implementation state from model semantics, enable compiler analysis of state access patterns.

Context

This RFC is an attempt to resolve the blocking issue described in

  • #41724

A 128MB persistent workspace buffer passed to every call, with no IR mechanism for persistent allocations

FP8 quantization state including mutable scale buffers, amax circular history, and dynamic scale recomputation

The IrState design proposed here directly addresses these requirements by providing a registry for persistent implementation-level state that IR ops can access without breaking the stateless IR contract.

Proposed Design

State Classification

CategoryManagementExamples
Model StateModel parameters — represents model semantics, persists after serializationWeights, static quantization scales
Implementation StateIrState registry — runtime details, doesn't affect semanticsDynamic amax, workspace buffers, NCCL handles

Rule: If it must persist after serialization → Model State. If it's a temporary runtime buffer → Implementation State.

IrState: Minimal Implementation

Note: This is the simplest working implementation. The exact coding style and API details can be adjusted based on review.

import contextvars

_ir_context = contextvars.ContextVar("ir_context", default={})


class IrContext:
    """Manages scope information in IR execution context."""
    
    @staticmethod
    def set_scope(key: str, value: str) -> contextvars.Token:
        current = _ir_context.get()
        return _ir_context.set({**current, key: value})
    
    @staticmethod
    def get_scope(key: str) -> str | None:
        return _ir_context.get().get(key)
    
    @staticmethod
    def reset(token: contextvars.Token) -> None:
        _ir_context.reset(token)


class IrState:
    """Global registry for implementation-level state, scoped by context."""
    
    def __init__(self):
        self._registry: dict[str, Callable] = {}
        self._cache: dict[tuple[str, str], Any] = {}
    
    def register_type(self, name: str, create_fn: Callable) -> None:
        self._registry[name] = create_fn
    
    def get(self, name: str, scope: str | None = None) -> Any:
        scope = scope or "default"
        key = (scope, name)
        if key not in self._cache:
            if name not in self._registry:
                raise KeyError(f"State type not registered: {name}")
            self._cache[key] = self._registry[name]()
        return self._cache[key]

Usage

# === 1. Registration (at model init) ===
ir_state.register_type("workspace", lambda: torch.empty(128*1024*1024, dtype=torch.uint8, device="cuda"))
ir_state.register_type("quant_amax", lambda: create_amax_state())

# === 2. Model layer sets scope before calling IR op ===
class MyLayer(nn.Module):
    def __init__(self, layer_idx):
        self.scope = f"layer_{layer_idx}"

    def forward(self, x):
        token = IrContext.set_scope("layer", self.scope)
        try:
            return ir.ops.quant_op(x)
        finally:
            IrContext.reset(token)

# === 3. In IR implementation ===
@ir.ops.quant_op.register_impl("impl_a")
def impl_a(x):
    scope = IrContext.get_scope("layer")
    workspace = ir_state.get("workspace", scope=scope)
    amax = ir_state.get("quant_amax", scope=scope)
    return process(x, workspace, amax)

State Division Principles

  1. State is a Python object — can be anything (tensor, dict, handle, class). IrState doesn't care about internal structure.
  2. State is independent, not bound to any impl — exists in global registry. Multiple impls can access the same state by name.
  3. Impl creates state via registration — impl knows what it needs and defines the create_fn. State is lazily created on first get().
  4. Clear separation: Model State vs Implementation State — Model State (parameters) represents model semantics; Implementation State (IrState) represents how to compute.

State Types and Scoping

  • Register once, scope per instance: A state type is registered once globally, but each layer/instance gets its own isolated state via scope.
  • Scope propagation: Uses Python contextvars to propagate scope through the call chain. Model code sets scope before calling IR ops; impls retrieve it via IrContext.get_scope().
RoleCode
Register typeir_state.register_type("quant_amax", create_fn)
Set scopeIrContext.set_scope("layer", f"layer_{idx}")
Get state in implir_state.get("quant_amax", scope=IrContext.get_scope("layer"))

Shared State Diagram

┌─────────────────────────────────────┐
│         IrState (Global)            │
│  ┌─────────────────────────────┐    │
│  │ "quant_state" → dict        │    │
│  │ "workspace" → Tensor        │    │
│  └─────────────────────────────┘    │
└──────────────┬──────────────────────┘
┌──────────────┼──────────────────────┐
│  Impl A ─────┘ → get("quant_state") │
│  Impl B ─────┘ → get("quant_state") │
│  Impl C ─────┘ → get("workspace")   │
└─────────────────────────────────────┘

Lifecycle

PhaseDescriptionOverhead
RegistrationStates registered alongside IR ops during model definitionOne-time, at model init
CreationLazy creation on first get() callMinimal (dict lookup + conditional)
AccessDirect pointer return from cacheNear-zero

Design Advantages

  1. High Cohesion: Each IR op manages its own state through the centralized registry.
  2. Static Analysis: States can be analyzed and pre-allocated before compilation.
  3. Easy Mocking: The registry pattern makes it straightforward to substitute mock states for testing.
  4. Independence: Nearly orthogonal to existing IR op definitions.
  5. Statelessness Preserved: IR ops remain semantically stateless since they declare data dependencies through the registry.
  6. Flexible State Content: State can be any Python object. The IR system doesn't care what's inside.

State Scoping

  • Per-process: All code in the same process shares one global IrState.
  • Cross-process isolation: TP/DP/PP workers each have independent state (separate processes).
  • No cleanup needed: State lives with the process. Python GC handles it.
  • Recompilation-safe: TorchDynamo recompilation doesn't affect _cache.

FX Graph Interaction

  • Tensor state is traced into the graph and subject to AOTAutograd's functionalization (which may insert clone on in-place mutations).
  • Non-tensor state (dicts, handles, etc.) executes during tracing as Python side effects — not captured in the graph.

Clone handling: A StateCloneEliminationPass removes clone operations on tensors obtained from ir_state.get(), preserving memory identity for in-place operations.

CUDA Graph Compatibility

Why Replay Doesn't Re-execute Creation

torch.empty() is a CPU-side function call, not a GPU kernel launch. During CUDA Graph recording:

  1. The CPU code executes once (including ir_state.get() and torch.empty())
  2. Only GPU kernel launches are captured in the graph
  3. During replay, the CPU code does NOT re-execute; the graph replays GPU commands using the original tensor addresses

Address Stability

During recording, tensor addresses are captured and remain valid during replay — as long as the tensor pointer itself is not modified. In-place content modification (.copy_(), [:] = ) is fine; pointer reassignment is not.

Compatibility Conditions

CUDA Graph recording will fail if the implementation contains:

  • CPU synchronization (.item(), torch.cuda.synchronize())
  • Dynamic control flow (if/for on tensor values)
  • Dynamic tensor creation that affects subsequent GPU operations

The algorithm should decide whether to support a CUDA Graph-compatible variant. We cannot force all implementations to support it.

Alternatives Considered

AlternativeWhy Not
Pass state as model parametersMixes implementation concerns with model semantics
Each impl manages its own stateNo centralized view, hard to analyze
Pre-create all states at initMany states need runtime info (shapes); wasteful

Decision: Lazy create + clone elimination pass balances flexibility and correctness with negligible overhead.

Implementation Plan

  1. Introduce IrState class with register/get interface
  2. Update a few stateful ops (e.g., dynamic quantization) to use it
  3. Implement StateCloneEliminationPass
  4. Migrate more ops and refine API

Acknowledgments

Thank you to the vLLM community for reviewing. Open to all feedback and iteration.

Proposed Change.

skip

Feedback Period.

No response

CC List.

@ProExpertProg @harshaljanjani

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING