vllm - 💡(How to fix) Fix [RFC]: Attention Backend Refactor

vllm2026-05-12 19:10:12

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Root Cause

Today KVCacheSpec conflates block allocation and tensor layout. Because they're bundled, the UniformTypeKVCacheSpecs wrapper exists solely to group layers with matching allocation policy but different page sizes — and is_uniform_type() must be extended with a new isinstance branch for every new spec type.

Fix Action

Fix / Workaround

Problems:

High boilerplate. Writing a new backend means implementing 4 classes . Adding fields AttentionMetadata requires modifications to AttentionMetadataBuilder, AttentionMetadata and AttentionImpl
Global manager registry. A hardcoded spec_manager_map (679–686 in single_type_kv_cache_manager.py) decides which manager handles each spec type. Adding a new spec type or a model-specific eviction policy requires patching this shared map plus extending the is_uniform_type() isinstance chain.
Coupled concerns in KVCacheSpec. Block allocation policy (how many blocks, when to recycle) and tensor layout (head dims, dtype, page layout) are bundled in one object. Layers that share allocation behavior but differ in tensor layout can't share a block table without the UniformTypeKVCacheSpecs wrapper.

Models can supply custom managers (e.g. a model-specific eviction policy) by returning a different manager_cls from get_cache_config() — no patching of shared registries.

Code Example

class AttentionBackend(ABC):

    def get_cache_config(self) -> tuple[
        type[SingleTypeKVCacheManager],
        ManagerConfig,
        KVCacheTensorSpec,
    ]:
        """Return the manager class, its config, and the tensor spec
        this backend needs. Layers with equal (cls, config) share a
        block table."""
        ...

    def bind_kv_caches(self, kv_caches: dict[str, Tensor],
                       block_table_map: dict[str, int]) -> None:
        """Bind allocated KV cache tensors and block-table mapping.
        block_table_map: layer_name → kv_cache_group_id."""
        ...

    def prep_forward(self, common_attn_metadata: CommonAttentionMetadata,
                     block_tables: dict[int, Tensor],
                     for_cudagraph_capture: bool = False) -> None:
        """Build backend-specific state once per step.
        block_tables: kv_cache_group_id → block table tensor.
        Replaces the MetadataBuilder + Metadata pair."""
        ...

    def forward(self, layer_config: LayerConfig,
                output: Tensor, query: Tensor, key: Tensor,
                value: Tensor, **kwargs) -> Tensor:
        """Execute attention for one layer. Per-layer differences
        (scale, sliding_window, num_heads) come via LayerConfig.
        KV cache is already bound via bind_kv_caches."""
        ...

---

# Current (~1289–1401 in gpu_model_runner.py): nested loop
for kv_cache_group_id, kv_cache_group_spec in enumerate(...):
    # ... build CommonAttentionMetadata ...
    for attn_group in self.attn_groups[kv_cache_group_id]:
        builder = attn_group.get_metadata_builder()
        attn_metadata_i = builder.build(
            common_prefix_len=...,
            common_attn_metadata=common_attn_metadata,
        )
        for layer_name in attn_group.layer_names:
            attn_metadata[layer_name] = attn_metadata_i

---

# Proposed: flat loop
for attn_group in self.attn_groups:
    attn_group.backend.prep_forward(
        common_metadata,
        block_tables={gid: block_tables[gid]
                      for gid in attn_group.kv_cache_group_ids},
    )

---

(manager_cls, manager_config, kv_cache_tensor_spec)

---

class FullAttentionManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        pass

class SlidingWindowManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        sliding_window: int

class MambaManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        mamba_cache_mode: str
        num_speculative_blocks: int

---

Worker:           collect (manager_cls, manager_config, tensor_spec) per backend
Worker:           group by (manager_cls, manager_config) equality
Worker:           compute block_size, kernel_block_size, num_blocks
Worker → Engine:  SchedulerKVCacheConfig { num_blocks, block_size, groups }
Engine:           min(num_blocks) across workers → broadcast agreed value
Worker:           allocate tensors locally with agreed num_blocks

RAW_BUFFERClick to expand / collapse

Follow-up to RFC 42082 (Standardize KV-cache Layouts). The KV layout standardization can proceed independently (we should complete this first to avoid excessive thrash; this RFC proposes a more aggressive refactor of the attention backend and KV cache manager interfaces.

Motivation.

Each attention backend today requires four interlinked classes (~750–1000 lines per backend):

Class	Instances	Role
`AttentionBackend`	(none) static	capability queries, factory methods
`AttentionMetadataBuilder`	1 per group	builds `AttentionMetadata` each step
`AttentionMetadata`	1 per group per step	backend-specific batch state
`AttentionImpl`	1 per layer	per-layer config; calls kernel in `forward()`

Problems:

High boilerplate. Writing a new backend means implementing 4 classes . Adding fields AttentionMetadata requires modifications to AttentionMetadataBuilder, AttentionMetadata and AttentionImpl
Global manager registry. A hardcoded spec_manager_map (679–686 in single_type_kv_cache_manager.py) decides which manager handles each spec type. Adding a new spec type or a model-specific eviction policy requires patching this shared map plus extending the is_uniform_type() isinstance chain.
Coupled concerns in KVCacheSpec. Block allocation policy (how many blocks, when to recycle) and tensor layout (head dims, dtype, page layout) are bundled in one object. Layers that share allocation behavior but differ in tensor layout can't share a block table without the UniformTypeKVCacheSpecs wrapper.

Proposed Change.

Two independent changes:

1. Single `AttentionBackend` Class

Collapse the four classes into one. The backend manages its own state — no separate metadata type.

class AttentionBackend(ABC):

    def get_cache_config(self) -> tuple[
        type[SingleTypeKVCacheManager],
        ManagerConfig,
        KVCacheTensorSpec,
    ]:
        """Return the manager class, its config, and the tensor spec
        this backend needs. Layers with equal (cls, config) share a
        block table."""
        ...

    def bind_kv_caches(self, kv_caches: dict[str, Tensor],
                       block_table_map: dict[str, int]) -> None:
        """Bind allocated KV cache tensors and block-table mapping.
        block_table_map: layer_name → kv_cache_group_id."""
        ...

    def prep_forward(self, common_attn_metadata: CommonAttentionMetadata,
                     block_tables: dict[int, Tensor],
                     for_cudagraph_capture: bool = False) -> None:
        """Build backend-specific state once per step.
        block_tables: kv_cache_group_id → block table tensor.
        Replaces the MetadataBuilder + Metadata pair."""
        ...

    def forward(self, layer_config: LayerConfig,
                output: Tensor, query: Tensor, key: Tensor,
                value: Tensor, **kwargs) -> Tensor:
        """Execute attention for one layer. Per-layer differences
        (scale, sliding_window, num_heads) come via LayerConfig.
        KV cache is already bound via bind_kv_caches."""
        ...

LayerConfig is a frozen dataclass carrying per-layer parameters (scale, sliding_window, num_kv_heads, head_size) so there is no need for per-layer AttentionImpl instances.

Simplified Runner Loop

Today the runner iterates a nested 2D structure (KV cache groups × attention groups) and delegates to a complex _build_attn_group_metadata path that handles builder state management, metadata caching, and type translation:

# Current (~1289–1401 in gpu_model_runner.py): nested loop
for kv_cache_group_id, kv_cache_group_spec in enumerate(...):
    # ... build CommonAttentionMetadata ...
    for attn_group in self.attn_groups[kv_cache_group_id]:
        builder = attn_group.get_metadata_builder()
        attn_metadata_i = builder.build(
            common_prefix_len=...,
            common_attn_metadata=common_attn_metadata,
        )
        for layer_name in attn_group.layer_names:
            attn_metadata[layer_name] = attn_metadata_i

With backends managing their own state via prep_forward, the runner flattens to a single loop over attention groups. Each group can span multiple KV cache groups — relevant block tables are passed as a dict:

# Proposed: flat loop
for attn_group in self.attn_groups:
    attn_group.backend.prep_forward(
        common_metadata,
        block_tables={gid: block_tables[gid]
                      for gid in attn_group.kv_cache_group_ids},
    )

The metadata builder abstraction, metadata caching layer, and the attn_metadata dict that maps layer names to per-step metadata objects all go away.

Migration

A LegacyBackendAdapter wraps existing 4-class backends, forwarding prep_forward → builder + build and forward → impl.forward. Backends migrate incrementally.

2. Backends Declare Manager + Tensor Spec

Instead, each backend returns the manager it needs:

(manager_cls, manager_config, kv_cache_tensor_spec)

manager_cls + manager_config — block allocation policy. Layers with equal (manager_cls, manager_config) share a block table. This replaces spec_manager_map, UniformTypeKVCacheSpecs, and is_uniform_type().
kv_cache_tensor_spec — physical tensor layout (the KVCacheTensorSpec from RFC 42082). Stays on the worker, never crosses to the scheduler.

Manager Configs

Each manager defines a frozen Config dataclass carrying only allocation-policy fields. block_size and num_blocks are constructor parameters (global values determined by the worker), not part of the config.

class FullAttentionManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        pass

class SlidingWindowManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        sliding_window: int

class MambaManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        mamba_cache_mode: str
        num_speculative_blocks: int

Models can supply custom managers (e.g. a model-specific eviction policy) by returning a different manager_cls from get_cache_config() — no patching of shared registries.

Worker-Driven Allocation

The worker has the model, the backends, and the GPU. It owns grouping and tensor allocation — the engine only needs allocation policy and a block budget:

Worker:           collect (manager_cls, manager_config, tensor_spec) per backend
Worker:           group by (manager_cls, manager_config) equality
Worker:           compute block_size, kernel_block_size, num_blocks
Worker → Engine:  SchedulerKVCacheConfig { num_blocks, block_size, groups }
Engine:           min(num_blocks) across workers → broadcast agreed value
Worker:           allocate tensors locally with agreed num_blocks

The scheduler never sees KVCacheTensorSpec. Tensor layout plans no longer cross the process boundary.

What Disappears

Today	After
`spec_manager_map` (global dict)	Gone — `manager_cls` returned by backend
`UniformTypeKVCacheSpecs` + `is_uniform_type()` isinstance chain	Gone — `(cls, config)` equality
`KVCacheTensor` layout plan crossing to engine	Gone — worker allocates locally
4-class backend structure (Backend + Builder + Metadata + Impl)	Gone — single `AttentionBackend`
`get_kv_cache_shape()` / `get_kv_cache_stride_order()` on backend	Gone — lives on `KVCacheTensorSpec` (RFC 42082)
Per-layer `AttentionImpl` instances	Gone — Layer specifics passed to `forward()`

Trade-offs

Model definitions must manage backend sharing. Today the runner creates one AttentionImpl per layer automatically. With the unified backend, the model is responsible for ensuring layers that should share a backend actually reference the same instance. This is more explicit but slightly more work for model authors.
LegacyBackendAdapter is needed during migration. All existing backends continue working behind the adapter, but the adapter adds one indirection layer until each backend is ported natively.

Feedback Period.

No response

CC List.

@MatthewBonanni @WoosukKwon @heheda12345

Worktracking

TODO: start work following https://github.com/vllm-project/vllm/issues/42082

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model save/load #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Attention Backend Refactor

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Proposed Change.

1. Single `AttentionBackend` Class

Simplified Runner Loop

Migration

2. Backends Declare Manager + Tensor Spec

Manager Configs

Worker-Driven Allocation

What Disappears

Trade-offs

Feedback Period.

CC List.

Worktracking

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Attention Backend Refactor

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Code Example

Motivation.

Proposed Change.

1. Single AttentionBackend Class

Simplified Runner Loop

Migration

2. Backends Declare Manager + Tensor Spec

Manager Configs

Worker-Driven Allocation

What Disappears

Trade-offs

Feedback Period.

CC List.

Worktracking

Still need to ship something?

RELATED_DISCOVERY

TRENDING

1. Single `AttentionBackend` Class