vllm - 💡(How to fix) Fix [RFC]: Attention Backend Refactor

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

Today KVCacheSpec conflates block allocation and tensor layout. Because they're bundled, the UniformTypeKVCacheSpecs wrapper exists solely to group layers with matching allocation policy but different page sizes — and is_uniform_type() must be extended with a new isinstance branch for every new spec type.

Fix Action

Fix / Workaround

Problems:

  1. High boilerplate. Writing a new backend means implementing 4 classes . Adding fields AttentionMetadata requires modifications to AttentionMetadataBuilder, AttentionMetadata and AttentionImpl
  2. Global manager registry. A hardcoded spec_manager_map (679–686 in single_type_kv_cache_manager.py) decides which manager handles each spec type. Adding a new spec type or a model-specific eviction policy requires patching this shared map plus extending the is_uniform_type() isinstance chain.
  3. Coupled concerns in KVCacheSpec. Block allocation policy (how many blocks, when to recycle) and tensor layout (head dims, dtype, page layout) are bundled in one object. Layers that share allocation behavior but differ in tensor layout can't share a block table without the UniformTypeKVCacheSpecs wrapper.

Models can supply custom managers (e.g. a model-specific eviction policy) by returning a different manager_cls from get_cache_config() — no patching of shared registries.

Code Example

class AttentionBackend(ABC):

    def get_cache_config(self) -> tuple[
        type[SingleTypeKVCacheManager],
        ManagerConfig,
        KVCacheTensorSpec,
    ]:
        """Return the manager class, its config, and the tensor spec
        this backend needs. Layers with equal (cls, config) share a
        block table."""
        ...

    def bind_kv_caches(self, kv_caches: dict[str, Tensor],
                       block_table_map: dict[str, int]) -> None:
        """Bind allocated KV cache tensors and block-table mapping.
        block_table_map: layer_name → kv_cache_group_id."""
        ...

    def prep_forward(self, common_attn_metadata: CommonAttentionMetadata,
                     block_tables: dict[int, Tensor],
                     for_cudagraph_capture: bool = False) -> None:
        """Build backend-specific state once per step.
        block_tables: kv_cache_group_id → block table tensor.
        Replaces the MetadataBuilder + Metadata pair."""
        ...

    def forward(self, layer_config: LayerConfig,
                output: Tensor, query: Tensor, key: Tensor,
                value: Tensor, **kwargs) -> Tensor:
        """Execute attention for one layer. Per-layer differences
        (scale, sliding_window, num_heads) come via LayerConfig.
        KV cache is already bound via bind_kv_caches."""
        ...

---

# Current (~12891401 in gpu_model_runner.py): nested loop
for kv_cache_group_id, kv_cache_group_spec in enumerate(...):
    # ... build CommonAttentionMetadata ...
    for attn_group in self.attn_groups[kv_cache_group_id]:
        builder = attn_group.get_metadata_builder()
        attn_metadata_i = builder.build(
            common_prefix_len=...,
            common_attn_metadata=common_attn_metadata,
        )
        for layer_name in attn_group.layer_names:
            attn_metadata[layer_name] = attn_metadata_i

---

# Proposed: flat loop
for attn_group in self.attn_groups:
    attn_group.backend.prep_forward(
        common_metadata,
        block_tables={gid: block_tables[gid]
                      for gid in attn_group.kv_cache_group_ids},
    )

---

(manager_cls, manager_config, kv_cache_tensor_spec)

---

class FullAttentionManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        pass

class SlidingWindowManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        sliding_window: int

class MambaManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        mamba_cache_mode: str
        num_speculative_blocks: int

---

Worker:           collect (manager_cls, manager_config, tensor_spec) per backend
Worker:           group by (manager_cls, manager_config) equality
Worker:           compute block_size, kernel_block_size, num_blocks
WorkerEngine:  SchedulerKVCacheConfig { num_blocks, block_size, groups }
Engine:           min(num_blocks) across workers → broadcast agreed value
Worker:           allocate tensors locally with agreed num_blocks
RAW_BUFFERClick to expand / collapse

Follow-up to RFC 42082 (Standardize KV-cache Layouts). The KV layout standardization can proceed independently (we should complete this first to avoid excessive thrash; this RFC proposes a more aggressive refactor of the attention backend and KV cache manager interfaces.

Motivation.

Each attention backend today requires four interlinked classes (~750–1000 lines per backend):

ClassInstancesRole
AttentionBackend(none) staticcapability queries, factory methods
AttentionMetadataBuilder1 per groupbuilds AttentionMetadata each step
AttentionMetadata1 per group per stepbackend-specific batch state
AttentionImpl1 per layerper-layer config; calls kernel in forward()

Problems:

  1. High boilerplate. Writing a new backend means implementing 4 classes . Adding fields AttentionMetadata requires modifications to AttentionMetadataBuilder, AttentionMetadata and AttentionImpl
  2. Global manager registry. A hardcoded spec_manager_map (679–686 in single_type_kv_cache_manager.py) decides which manager handles each spec type. Adding a new spec type or a model-specific eviction policy requires patching this shared map plus extending the is_uniform_type() isinstance chain.
  3. Coupled concerns in KVCacheSpec. Block allocation policy (how many blocks, when to recycle) and tensor layout (head dims, dtype, page layout) are bundled in one object. Layers that share allocation behavior but differ in tensor layout can't share a block table without the UniformTypeKVCacheSpecs wrapper.

Proposed Change.

Two independent changes:

1. Single AttentionBackend Class

Collapse the four classes into one. The backend manages its own state — no separate metadata type.

class AttentionBackend(ABC):

    def get_cache_config(self) -> tuple[
        type[SingleTypeKVCacheManager],
        ManagerConfig,
        KVCacheTensorSpec,
    ]:
        """Return the manager class, its config, and the tensor spec
        this backend needs. Layers with equal (cls, config) share a
        block table."""
        ...

    def bind_kv_caches(self, kv_caches: dict[str, Tensor],
                       block_table_map: dict[str, int]) -> None:
        """Bind allocated KV cache tensors and block-table mapping.
        block_table_map: layer_name → kv_cache_group_id."""
        ...

    def prep_forward(self, common_attn_metadata: CommonAttentionMetadata,
                     block_tables: dict[int, Tensor],
                     for_cudagraph_capture: bool = False) -> None:
        """Build backend-specific state once per step.
        block_tables: kv_cache_group_id → block table tensor.
        Replaces the MetadataBuilder + Metadata pair."""
        ...

    def forward(self, layer_config: LayerConfig,
                output: Tensor, query: Tensor, key: Tensor,
                value: Tensor, **kwargs) -> Tensor:
        """Execute attention for one layer. Per-layer differences
        (scale, sliding_window, num_heads) come via LayerConfig.
        KV cache is already bound via bind_kv_caches."""
        ...

LayerConfig is a frozen dataclass carrying per-layer parameters (scale, sliding_window, num_kv_heads, head_size) so there is no need for per-layer AttentionImpl instances.

Simplified Runner Loop

Today the runner iterates a nested 2D structure (KV cache groups × attention groups) and delegates to a complex _build_attn_group_metadata path that handles builder state management, metadata caching, and type translation:

# Current (~1289–1401 in gpu_model_runner.py): nested loop
for kv_cache_group_id, kv_cache_group_spec in enumerate(...):
    # ... build CommonAttentionMetadata ...
    for attn_group in self.attn_groups[kv_cache_group_id]:
        builder = attn_group.get_metadata_builder()
        attn_metadata_i = builder.build(
            common_prefix_len=...,
            common_attn_metadata=common_attn_metadata,
        )
        for layer_name in attn_group.layer_names:
            attn_metadata[layer_name] = attn_metadata_i

With backends managing their own state via prep_forward, the runner flattens to a single loop over attention groups. Each group can span multiple KV cache groups — relevant block tables are passed as a dict:

# Proposed: flat loop
for attn_group in self.attn_groups:
    attn_group.backend.prep_forward(
        common_metadata,
        block_tables={gid: block_tables[gid]
                      for gid in attn_group.kv_cache_group_ids},
    )

The metadata builder abstraction, metadata caching layer, and the attn_metadata dict that maps layer names to per-step metadata objects all go away.

Migration

A LegacyBackendAdapter wraps existing 4-class backends, forwarding prep_forward → builder + build and forward → impl.forward. Backends migrate incrementally.

2. Backends Declare Manager + Tensor Spec

Today KVCacheSpec conflates block allocation and tensor layout. Because they're bundled, the UniformTypeKVCacheSpecs wrapper exists solely to group layers with matching allocation policy but different page sizes — and is_uniform_type() must be extended with a new isinstance branch for every new spec type.

Instead, each backend returns the manager it needs:

(manager_cls, manager_config, kv_cache_tensor_spec)
  • manager_cls + manager_config — block allocation policy. Layers with equal (manager_cls, manager_config) share a block table. This replaces spec_manager_map, UniformTypeKVCacheSpecs, and is_uniform_type().
  • kv_cache_tensor_spec — physical tensor layout (the KVCacheTensorSpec from RFC 42082). Stays on the worker, never crosses to the scheduler.
Manager Configs

Each manager defines a frozen Config dataclass carrying only allocation-policy fields. block_size and num_blocks are constructor parameters (global values determined by the worker), not part of the config.

class FullAttentionManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        pass

class SlidingWindowManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        sliding_window: int

class MambaManager(SingleTypeKVCacheManager):
    @dataclass(frozen=True)
    class Config:
        mamba_cache_mode: str
        num_speculative_blocks: int

Models can supply custom managers (e.g. a model-specific eviction policy) by returning a different manager_cls from get_cache_config() — no patching of shared registries.

Worker-Driven Allocation

The worker has the model, the backends, and the GPU. It owns grouping and tensor allocation — the engine only needs allocation policy and a block budget:

Worker:           collect (manager_cls, manager_config, tensor_spec) per backend
Worker:           group by (manager_cls, manager_config) equality
Worker:           compute block_size, kernel_block_size, num_blocks
Worker → Engine:  SchedulerKVCacheConfig { num_blocks, block_size, groups }
Engine:           min(num_blocks) across workers → broadcast agreed value
Worker:           allocate tensors locally with agreed num_blocks

The scheduler never sees KVCacheTensorSpec. Tensor layout plans no longer cross the process boundary.

What Disappears

TodayAfter
spec_manager_map (global dict)Gone — manager_cls returned by backend
UniformTypeKVCacheSpecs + is_uniform_type() isinstance chainGone — (cls, config) equality
KVCacheTensor layout plan crossing to engineGone — worker allocates locally
4-class backend structure (Backend + Builder + Metadata + Impl)Gone — single AttentionBackend
get_kv_cache_shape() / get_kv_cache_stride_order() on backendGone — lives on KVCacheTensorSpec (RFC 42082)
Per-layer AttentionImpl instancesGone — Layer specifics passed to forward()

Trade-offs

  • Model definitions must manage backend sharing. Today the runner creates one AttentionImpl per layer automatically. With the unified backend, the model is responsible for ensuring layers that should share a backend actually reference the same instance. This is more explicit but slightly more work for model authors.
  • LegacyBackendAdapter is needed during migration. All existing backends continue working behind the adapter, but the adapter adds one indirection layer until each backend is ported natively.

Feedback Period.

No response

CC List.

@MatthewBonanni @WoosukKwon @heheda12345

Worktracking

TODO: start work following https://github.com/vllm-project/vllm/issues/42082

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Attention Backend Refactor