vllm - ✅(Solved) Fix [Feature]: Support per-layer sliding window attention for Qwen3 [1 pull requests, 1 participants]

bzantium · 2026-04-10T16:34:28Z

[vllm] PR 39515: Model Support per-layer sliding window attention for Qwen3 - Repository: vllm-project/vllm - Author: bzantium - State: open | merged: False -… # PR #39515: [Model] Support per-layer sliding window attention for Qwen3 - Repository: vllm-project/vllm - Author: bzantium - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/39515 ## Description (problem / solution / changelog) ## Summary - Add per-layer sliding window attention support for Qwen3 and Qwen3MoE models - Read `layer_types` from `Qwen3Config` and pass `per_layer_sliding_window` to `Attention`, following the same pattern as Gemma2 - Enables interleaved sliding/full attention patterns (e.g., 3:1 sliding:full) Fixes #39514 ## Motivation Transformers' `Qwen3Config` already supports `layer_types` with `"sliding_attention"` / `"full_attention"` entries and a `sliding_window` size, but vLLM was not wiring this up. We are training a new model based on Qwen3 with interleaved sliding window attention and need vLLM to support it for inference. ## Changes - **`vllm/model_executor/models/qwen3.py`**: `Qwen3DecoderLayer` now extracts `layer_types[layer_idx]` and computes `per_layer_sliding_window`, passing it through `Qwen3Attention` to the `Attention` layer. - **`vllm/model_executor/models/qwen3_moe.py`**: Same change applied to `Qwen3MoeDecoderLayer` and `Qwen3MoeAttention`. ## Compatibility - Standard Qwen3 (all `full_attention`) is unaffected — `per_layer_sliding_window` remains `None`. - The existing `is_interleaved()` check correctly sets `cache_config.sliding_window = None` for mixed-type models, preventing full-attention layers from inheriting the global sliding window. - All standard attention backends (FlashAttention, FlashInfer, Triton, FlexAttention) support sliding window. ## Test plan - [ ] Verify standard Qwen3 models (e.g., `Qwen/Qwen3-8B`) still work with no behavior change - [ ] Verify Qwen3 with interleaved `layer_types` config correctly applies per-layer sliding window - [ ] Run existing sliding window correctness tests (`tests/v1/e2e/general/test_correctness_sliding_window.py`) ## Changed files - `vllm/model_executor/models/qwen3.py` (modified, +12/-0) - `vllm/model_executor/models/qwen3_moe.py` (modified, +12/-1) ## Fixed - Fixed by PR: [Model] Support per-layer sliding window attention for Qwen3 (https://github.com/vllm-project/vllm/pull/39515) ### Motivation Qwen3's HuggingFace config (`Qwen3Config`) already supports per-layer sliding window attention via the `layer_types` attribute (with `"sliding_attention"` and `"full_attention"` entries) and `sliding_window` size. This is used when `use_sliding_window=True` — either through explicit `layer_types` in `config.json` or auto-generated from `max_window_layers`. However, vLLM's Qwen3 model implementation does not read `layer_types` or pass `per_layer_sliding_window` to the `Attention` layer, so all layers always use full attention regardless of the config. This feature is needed because: - We are training a new model based on the Qwen3 architecture with interleaved sliding window attention (e.g., 3:1 sliding:full pattern), and it will be released when training is done. - Transformers already supports this natively — vLLM should be consistent. - Other models in vLLM (Gemma2, Llama, CommandR, etc.) already implement this exact pattern. ### Proposed Change Wire up `config.layer_types` in `Qwen3Attention` and `Qwen3MoeAttention` to pass `per_layer_sliding_window` to the `Attention` constructor, following the same pattern as Gemma2. ### Additional Context - The `is_interleaved()` detection in `vllm/transformers_utils/config.py` already correctly identifies mixed `layer_types`, so `cache_config.sliding_window` is properly set to `None` for interleaved models (preventing full-attention layers from inheriting the global sliding window). - All standard attention backends (FlashAttention, FlashInfer, Triton, FlexAttention) support sliding window. - No changes are needed outside the model files — the existing infrastructure handles everything.

vllm2026-04-10 16:34:28

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39514•Fetched 2026-04-11 06:13:07

View on GitHub

Comments

Participants

Timeline

Reactions

Author

bzantium

Participants

bzantium

Timeline (top)

cross-referenced ×1

Root Cause

This feature is needed because:

We are training a new model based on the Qwen3 architecture with interleaved sliding window attention (e.g., 3:1 sliding:full pattern), and it will be released when training is done.
Transformers already supports this natively — vLLM should be consistent.
Other models in vLLM (Gemma2, Llama, CommandR, etc.) already implement this exact pattern.

Fix Action

Fixed

Fixed by PR: [Model] Support per-layer sliding window attention for Qwen3 (https://github.com/vllm-project/vllm/pull/39515)

PR fix notes

PR #39515: [Model] Support per-layer sliding window attention for Qwen3

Repository: vllm-project/vllm
Author: bzantium
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/39515

Description (problem / solution / changelog)

Summary

Add per-layer sliding window attention support for Qwen3 and Qwen3MoE models
Read layer_types from Qwen3Config and pass per_layer_sliding_window to Attention, following the same pattern as Gemma2
Enables interleaved sliding/full attention patterns (e.g., 3:1 sliding:full)

Fixes #39514

Motivation

Transformers' Qwen3Config already supports layer_types with "sliding_attention" / "full_attention" entries and a sliding_window size, but vLLM was not wiring this up. We are training a new model based on Qwen3 with interleaved sliding window attention and need vLLM to support it for inference.

Changes

vllm/model_executor/models/qwen3.py: Qwen3DecoderLayer now extracts layer_types[layer_idx] and computes per_layer_sliding_window, passing it through Qwen3Attention to the Attention layer.
vllm/model_executor/models/qwen3_moe.py: Same change applied to Qwen3MoeDecoderLayer and Qwen3MoeAttention.

Compatibility

Standard Qwen3 (all full_attention) is unaffected — per_layer_sliding_window remains None.
The existing is_interleaved() check correctly sets cache_config.sliding_window = None for mixed-type models, preventing full-attention layers from inheriting the global sliding window.
All standard attention backends (FlashAttention, FlashInfer, Triton, FlexAttention) support sliding window.

Test plan

Verify standard Qwen3 models (e.g., Qwen/Qwen3-8B) still work with no behavior change
Verify Qwen3 with interleaved layer_types config correctly applies per-layer sliding window
Run existing sliding window correctness tests (tests/v1/e2e/general/test_correctness_sliding_window.py)

Changed files

vllm/model_executor/models/qwen3.py (modified, +12/-0)
vllm/model_executor/models/qwen3_moe.py (modified, +12/-1)

RAW_BUFFERClick to expand / collapse

Motivation

Qwen3's HuggingFace config (Qwen3Config) already supports per-layer sliding window attention via the layer_types attribute (with "sliding_attention" and "full_attention" entries) and sliding_window size. This is used when use_sliding_window=True — either through explicit layer_types in config.json or auto-generated from max_window_layers.

However, vLLM's Qwen3 model implementation does not read layer_types or pass per_layer_sliding_window to the Attention layer, so all layers always use full attention regardless of the config.

This feature is needed because:

We are training a new model based on the Qwen3 architecture with interleaved sliding window attention (e.g., 3:1 sliding:full pattern), and it will be released when training is done.
Transformers already supports this natively — vLLM should be consistent.
Other models in vLLM (Gemma2, Llama, CommandR, etc.) already implement this exact pattern.

Proposed Change

Wire up config.layer_types in Qwen3Attention and Qwen3MoeAttention to pass per_layer_sliding_window to the Attention constructor, following the same pattern as Gemma2.

Additional Context

The is_interleaved() detection in vllm/transformers_utils/config.py already correctly identifies mixed layer_types, so cache_config.sliding_window is properly set to None for interleaved models (preventing full-attention layers from inheriting the global sliding window).
All standard attention backends (FlashAttention, FlashInfer, Triton, FlexAttention) support sliding window.
No changes are needed outside the model files — the existing infrastructure handles everything.

extent analysis

TL;DR

Update the Qwen3Attention and Qwen3MoeAttention classes to utilize the layer_types attribute from the Qwen3Config to enable per-layer sliding window attention.

Guidance

Review the Gemma2 model implementation to understand how it handles per-layer sliding window attention and replicate this pattern in Qwen3.
Modify the Qwen3Attention and Qwen3MoeAttention classes to accept and pass per_layer_sliding_window to the Attention constructor based on the layer_types config.
Verify that the is_interleaved() detection in vllm/transformers_utils/config.py correctly identifies mixed layer_types and sets cache_config.sliding_window to None for interleaved models.
Test the updated Qwen3 model implementation with different layer_types configurations to ensure correct behavior.

Example

class Qwen3Attention(...):
    def __init__(self, config, ...):
        ...
        self.per_layer_sliding_window = config.layer_types == "sliding_attention"
        ...
        Attention(..., per_layer_sliding_window=self.per_layer_sliding_window)

Notes

This fix assumes that the layer_types attribute in Qwen3Config is correctly set and that the is_interleaved() detection works as expected. Additional testing may be necessary to ensure the updated implementation works correctly for all possible configurations.

Recommendation

Apply workaround by updating the Qwen3Attention and Qwen3MoeAttention classes to support per-layer sliding window attention, as this will provide consistency with other models in vLLM and enable the desired functionality.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#device allocation #model download #tokenizer error #prompt formatting #chain error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Feature]: Support per-layer sliding window attention for Qwen3 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39515: [Model] Support per-layer sliding window attention for Qwen3

Description (problem / solution / changelog)

Summary

Motivation

Changes

Compatibility

Test plan

Changed files

Motivation

Proposed Change

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Feature]: Support per-layer sliding window attention for Qwen3 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #39515: [Model] Support per-layer sliding window attention for Qwen3

Description (problem / solution / changelog)

Summary

Motivation

Changes

Compatibility

Test plan

Changed files

Motivation

Proposed Change

Additional Context

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING