vllm - ✅(Solved) Fix [Feature]: Support per-layer sliding window attention for Qwen3 [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39514Fetched 2026-04-11 06:13:07
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Root Cause

This feature is needed because:

  • We are training a new model based on the Qwen3 architecture with interleaved sliding window attention (e.g., 3:1 sliding:full pattern), and it will be released when training is done.
  • Transformers already supports this natively — vLLM should be consistent.
  • Other models in vLLM (Gemma2, Llama, CommandR, etc.) already implement this exact pattern.

Fix Action

Fixed

PR fix notes

PR #39515: [Model] Support per-layer sliding window attention for Qwen3

Description (problem / solution / changelog)

Summary

  • Add per-layer sliding window attention support for Qwen3 and Qwen3MoE models
  • Read layer_types from Qwen3Config and pass per_layer_sliding_window to Attention, following the same pattern as Gemma2
  • Enables interleaved sliding/full attention patterns (e.g., 3:1 sliding:full)

Fixes #39514

Motivation

Transformers' Qwen3Config already supports layer_types with "sliding_attention" / "full_attention" entries and a sliding_window size, but vLLM was not wiring this up. We are training a new model based on Qwen3 with interleaved sliding window attention and need vLLM to support it for inference.

Changes

  • vllm/model_executor/models/qwen3.py: Qwen3DecoderLayer now extracts layer_types[layer_idx] and computes per_layer_sliding_window, passing it through Qwen3Attention to the Attention layer.
  • vllm/model_executor/models/qwen3_moe.py: Same change applied to Qwen3MoeDecoderLayer and Qwen3MoeAttention.

Compatibility

  • Standard Qwen3 (all full_attention) is unaffected — per_layer_sliding_window remains None.
  • The existing is_interleaved() check correctly sets cache_config.sliding_window = None for mixed-type models, preventing full-attention layers from inheriting the global sliding window.
  • All standard attention backends (FlashAttention, FlashInfer, Triton, FlexAttention) support sliding window.

Test plan

  • Verify standard Qwen3 models (e.g., Qwen/Qwen3-8B) still work with no behavior change
  • Verify Qwen3 with interleaved layer_types config correctly applies per-layer sliding window
  • Run existing sliding window correctness tests (tests/v1/e2e/general/test_correctness_sliding_window.py)

Changed files

  • vllm/model_executor/models/qwen3.py (modified, +12/-0)
  • vllm/model_executor/models/qwen3_moe.py (modified, +12/-1)
RAW_BUFFERClick to expand / collapse

Motivation

Qwen3's HuggingFace config (Qwen3Config) already supports per-layer sliding window attention via the layer_types attribute (with "sliding_attention" and "full_attention" entries) and sliding_window size. This is used when use_sliding_window=True — either through explicit layer_types in config.json or auto-generated from max_window_layers.

However, vLLM's Qwen3 model implementation does not read layer_types or pass per_layer_sliding_window to the Attention layer, so all layers always use full attention regardless of the config.

This feature is needed because:

  • We are training a new model based on the Qwen3 architecture with interleaved sliding window attention (e.g., 3:1 sliding:full pattern), and it will be released when training is done.
  • Transformers already supports this natively — vLLM should be consistent.
  • Other models in vLLM (Gemma2, Llama, CommandR, etc.) already implement this exact pattern.

Proposed Change

Wire up config.layer_types in Qwen3Attention and Qwen3MoeAttention to pass per_layer_sliding_window to the Attention constructor, following the same pattern as Gemma2.

Additional Context

  • The is_interleaved() detection in vllm/transformers_utils/config.py already correctly identifies mixed layer_types, so cache_config.sliding_window is properly set to None for interleaved models (preventing full-attention layers from inheriting the global sliding window).
  • All standard attention backends (FlashAttention, FlashInfer, Triton, FlexAttention) support sliding window.
  • No changes are needed outside the model files — the existing infrastructure handles everything.

extent analysis

TL;DR

Update the Qwen3Attention and Qwen3MoeAttention classes to utilize the layer_types attribute from the Qwen3Config to enable per-layer sliding window attention.

Guidance

  • Review the Gemma2 model implementation to understand how it handles per-layer sliding window attention and replicate this pattern in Qwen3.
  • Modify the Qwen3Attention and Qwen3MoeAttention classes to accept and pass per_layer_sliding_window to the Attention constructor based on the layer_types config.
  • Verify that the is_interleaved() detection in vllm/transformers_utils/config.py correctly identifies mixed layer_types and sets cache_config.sliding_window to None for interleaved models.
  • Test the updated Qwen3 model implementation with different layer_types configurations to ensure correct behavior.

Example

class Qwen3Attention(...):
    def __init__(self, config, ...):
        ...
        self.per_layer_sliding_window = config.layer_types == "sliding_attention"
        ...
        Attention(..., per_layer_sliding_window=self.per_layer_sliding_window)

Notes

This fix assumes that the layer_types attribute in Qwen3Config is correctly set and that the is_interleaved() detection works as expected. Additional testing may be necessary to ensure the updated implementation works correctly for all possible configurations.

Recommendation

Apply workaround by updating the Qwen3Attention and Qwen3MoeAttention classes to support per-layer sliding window attention, as this will provide consistency with other models in vLLM and enable the desired functionality.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Support per-layer sliding window attention for Qwen3 [1 pull requests, 1 participants]