vllm - 💡(How to fix) Fix [RFC]: Introducing State Management in vLLM IR System

vllm2026-05-09 11:31:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

This RFC proposes IrState, a global registry for managing implementation-level state (dynamic quantization stats, workspace buffers, NCCL handles, etc.) in the vLLM IR system. IR ops remain semantically stateless; state is registered and accessed via ir_state.get().

Root Cause

Code Example

import contextvars

_ir_context = contextvars.ContextVar("ir_context", default={})


class IrContext:
    """Manages scope information in IR execution context."""
    
    @staticmethod
    def set_scope(key: str, value: str) -> contextvars.Token:
        current = _ir_context.get()
        return _ir_context.set({**current, key: value})
    
    @staticmethod
    def get_scope(key: str) -> str | None:
        return _ir_context.get().get(key)
    
    @staticmethod
    def reset(token: contextvars.Token) -> None:
        _ir_context.reset(token)


class IrState:
    """Global registry for implementation-level state, scoped by context."""
    
    def __init__(self):
        self._registry: dict[str, Callable] = {}
        self._cache: dict[tuple[str, str], Any] = {}
    
    def register_type(self, name: str, create_fn: Callable) -> None:
        self._registry[name] = create_fn
    
    def get(self, name: str, scope: str | None = None) -> Any:
        scope = scope or "default"
        key = (scope, name)
        if key not in self._cache:
            if name not in self._registry:
                raise KeyError(f"State type not registered: {name}")
            self._cache[key] = self._registry[name]()
        return self._cache[key]

---

# === 1. Registration (at model init) ===
ir_state.register_type("workspace", lambda: torch.empty(128*1024*1024, dtype=torch.uint8, device="cuda"))
ir_state.register_type("quant_amax", lambda: create_amax_state())

# === 2. Model layer sets scope before calling IR op ===
class MyLayer(nn.Module):
    def __init__(self, layer_idx):
        self.scope = f"layer_{layer_idx}"

    def forward(self, x):
        token = IrContext.set_scope("layer", self.scope)
        try:
            return ir.ops.quant_op(x)
        finally:
            IrContext.reset(token)

# === 3. In IR implementation ===
@ir.ops.quant_op.register_impl("impl_a")
def impl_a(x):
    scope = IrContext.get_scope("layer")
    workspace = ir_state.get("workspace", scope=scope)
    amax = ir_state.get("quant_amax", scope=scope)
    return process(x, workspace, amax)

---

┌─────────────────────────────────────┐
│         IrState (Global)            │
│  ┌─────────────────────────────┐    │
│  │ "quant_state" → dict        │    │
│  │ "workspace" → Tensor        │    │
│  └─────────────────────────────┘    │
└──────────────┬──────────────────────┘
               │
┌──────────────┼──────────────────────┐
│  Impl A ─────┘ → get("quant_state") │
│  Impl B ─────┘ → get("quant_state") │
│  Impl C ─────┘ → get("workspace")   │
└─────────────────────────────────────┘

RAW_BUFFERClick to expand / collapse

Motivation.

RFC: Introducing State Management in vLLM IR System

Disclaimer: This document represents my personal thoughts and explorations on how state management could be integrated into the vLLM IR system. I share these ideas humbly for community discussion and am fully open to feedback, alternative approaches, and any decisions the community makes. The final design should reflect the collective wisdom of the vLLM community.

Summary

Motivation

The vLLM IR is designed stateless for deferred kernel selection and compiler analysis. However, many algorithms need state:

Use Case	State Needed
Dynamic Quantization	`amax` statistics, scaling factors
Flash Attention	Internal buffers, workspace memory
NCCL Collectives	Communicator handles
CUDA Graph	Graph cache

Without formal state management, implementations either pass state through model parameters (mixing concerns), use ad-hoc globals, or recreate state every time.

Goals: Express stateful algorithms cleanly, separate implementation state from model semantics, enable compiler analysis of state access patterns.

Context

This RFC is an attempt to resolve the blocking issue described in

#41724

A 128MB persistent workspace buffer passed to every call, with no IR mechanism for persistent allocations

FP8 quantization state including mutable scale buffers, amax circular history, and dynamic scale recomputation

The IrState design proposed here directly addresses these requirements by providing a registry for persistent implementation-level state that IR ops can access without breaking the stateless IR contract.

Proposed Design

State Classification

Category	Management	Examples
Model State	Model parameters — represents model semantics, persists after serialization	Weights, static quantization scales
Implementation State	`IrState` registry — runtime details, doesn't affect semantics	Dynamic `amax`, workspace buffers, NCCL handles

Rule: If it must persist after serialization → Model State. If it's a temporary runtime buffer → Implementation State.

IrState: Minimal Implementation

Note: This is the simplest working implementation. The exact coding style and API details can be adjusted based on review.

import contextvars

_ir_context = contextvars.ContextVar("ir_context", default={})


class IrContext:
    """Manages scope information in IR execution context."""
    
    @staticmethod
    def set_scope(key: str, value: str) -> contextvars.Token:
        current = _ir_context.get()
        return _ir_context.set({**current, key: value})
    
    @staticmethod
    def get_scope(key: str) -> str | None:
        return _ir_context.get().get(key)
    
    @staticmethod
    def reset(token: contextvars.Token) -> None:
        _ir_context.reset(token)


class IrState:
    """Global registry for implementation-level state, scoped by context."""
    
    def __init__(self):
        self._registry: dict[str, Callable] = {}
        self._cache: dict[tuple[str, str], Any] = {}
    
    def register_type(self, name: str, create_fn: Callable) -> None:
        self._registry[name] = create_fn
    
    def get(self, name: str, scope: str | None = None) -> Any:
        scope = scope or "default"
        key = (scope, name)
        if key not in self._cache:
            if name not in self._registry:
                raise KeyError(f"State type not registered: {name}")
            self._cache[key] = self._registry[name]()
        return self._cache[key]

Usage

# === 1. Registration (at model init) ===
ir_state.register_type("workspace", lambda: torch.empty(128*1024*1024, dtype=torch.uint8, device="cuda"))
ir_state.register_type("quant_amax", lambda: create_amax_state())

# === 2. Model layer sets scope before calling IR op ===
class MyLayer(nn.Module):
    def __init__(self, layer_idx):
        self.scope = f"layer_{layer_idx}"

    def forward(self, x):
        token = IrContext.set_scope("layer", self.scope)
        try:
            return ir.ops.quant_op(x)
        finally:
            IrContext.reset(token)

# === 3. In IR implementation ===
@ir.ops.quant_op.register_impl("impl_a")
def impl_a(x):
    scope = IrContext.get_scope("layer")
    workspace = ir_state.get("workspace", scope=scope)
    amax = ir_state.get("quant_amax", scope=scope)
    return process(x, workspace, amax)

State Division Principles

State is a Python object — can be anything (tensor, dict, handle, class). IrState doesn't care about internal structure.
State is independent, not bound to any impl — exists in global registry. Multiple impls can access the same state by name.
Impl creates state via registration — impl knows what it needs and defines the create_fn. State is lazily created on first get().
Clear separation: Model State vs Implementation State — Model State (parameters) represents model semantics; Implementation State (IrState) represents how to compute.

State Types and Scoping

Register once, scope per instance: A state type is registered once globally, but each layer/instance gets its own isolated state via scope.
Scope propagation: Uses Python contextvars to propagate scope through the call chain. Model code sets scope before calling IR ops; impls retrieve it via IrContext.get_scope().

Role	Code
Register type	`ir_state.register_type("quant_amax", create_fn)`
Set scope	`IrContext.set_scope("layer", f"layer_{idx}")`
Get state in impl	`ir_state.get("quant_amax", scope=IrContext.get_scope("layer"))`

Shared State Diagram

┌─────────────────────────────────────┐
│         IrState (Global)            │
│  ┌─────────────────────────────┐    │
│  │ "quant_state" → dict        │    │
│  │ "workspace" → Tensor        │    │
│  └─────────────────────────────┘    │
└──────────────┬──────────────────────┘
               │
┌──────────────┼──────────────────────┐
│  Impl A ─────┘ → get("quant_state") │
│  Impl B ─────┘ → get("quant_state") │
│  Impl C ─────┘ → get("workspace")   │
└─────────────────────────────────────┘

Lifecycle

Phase	Description	Overhead
Registration	States registered alongside IR ops during model definition	One-time, at model init
Creation	Lazy creation on first `get()` call	Minimal (dict lookup + conditional)
Access	Direct pointer return from cache	Near-zero

Design Advantages

High Cohesion: Each IR op manages its own state through the centralized registry.
Static Analysis: States can be analyzed and pre-allocated before compilation.
Easy Mocking: The registry pattern makes it straightforward to substitute mock states for testing.
Independence: Nearly orthogonal to existing IR op definitions.
Statelessness Preserved: IR ops remain semantically stateless since they declare data dependencies through the registry.
Flexible State Content: State can be any Python object. The IR system doesn't care what's inside.

State Scoping

Per-process: All code in the same process shares one global IrState.
Cross-process isolation: TP/DP/PP workers each have independent state (separate processes).
No cleanup needed: State lives with the process. Python GC handles it.
Recompilation-safe: TorchDynamo recompilation doesn't affect _cache.

FX Graph Interaction

Tensor state is traced into the graph and subject to AOTAutograd's functionalization (which may insert clone on in-place mutations).
Non-tensor state (dicts, handles, etc.) executes during tracing as Python side effects — not captured in the graph.

Clone handling: A StateCloneEliminationPass removes clone operations on tensors obtained from ir_state.get(), preserving memory identity for in-place operations.

CUDA Graph Compatibility

Why Replay Doesn't Re-execute Creation

torch.empty() is a CPU-side function call, not a GPU kernel launch. During CUDA Graph recording:

The CPU code executes once (including ir_state.get() and torch.empty())
Only GPU kernel launches are captured in the graph
During replay, the CPU code does NOT re-execute; the graph replays GPU commands using the original tensor addresses

Address Stability

During recording, tensor addresses are captured and remain valid during replay — as long as the tensor pointer itself is not modified. In-place content modification (.copy_(), [:] = ) is fine; pointer reassignment is not.

Compatibility Conditions

CUDA Graph recording will fail if the implementation contains:

CPU synchronization (.item(), torch.cuda.synchronize())
Dynamic control flow (if/for on tensor values)
Dynamic tensor creation that affects subsequent GPU operations

The algorithm should decide whether to support a CUDA Graph-compatible variant. We cannot force all implementations to support it.

Alternatives Considered

Alternative	Why Not
Pass state as model parameters	Mixes implementation concerns with model semantics
Each impl manages its own state	No centralized view, hard to analyze
Pre-create all states at init	Many states need runtime info (shapes); wasteful

Decision: Lazy create + clone elimination pass balances flexibility and correctness with negligible overhead.

Implementation Plan

Introduce IrState class with register/get interface
Update a few stateful ops (e.g., dynamic quantization) to use it
Implement StateCloneEliminationPass
Migrate more ops and refine API

Acknowledgments

Thank you to the vLLM community for reviewing. Open to all feedback and iteration.

Proposed Change.

skip

Feedback Period.

No response

CC List.

@ProExpertProg @harshaljanjani

Any Other Things.

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #chain error #conversation history #tool integration #LLM response

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.