vllm - ✅(Solved) Fix [Bug]: Qwen1 use_logn_attn may be unsupported in vLLM [1 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36880Fetched 2026-04-08 00:43:50
View on GitHub
Comments
2
Participants
2
Timeline
8
Reactions
0
Author
Participants
Timeline (top)
referenced ×4commented ×2cross-referenced ×1labeled ×1

Fix Action

Fixed

PR fix notes

PR #37089: [Model] Support Qwen1 use_logn_attn and use_dynamic_ntk

Description (problem / solution / changelog)

Purpose

Fix #36880.

Qwen1 models (e.g. Qwen-7B) define two context-length extrapolation features — use_logn_attn and use_dynamic_ntk — that vLLM previously silently ignored. This PR adds support for both:

  • use_logn_attn: After RoPE, scales query vectors by log_{seq_length}(position) for positions beyond the training sequence length (seq_length), compensating for attention score dilution at long contexts. The scaling tensor is precomputed as a non-persistent registered buffer, compatible with torch.compile and tensor parallelism.
  • use_dynamic_ntk: Translates the Qwen1 config flag into DynamicNTKAlphaRotaryEmbedding parameters using the original NTK alpha formula (2^ceil(log2(max_pos/seq_len) + 1) - 1), reusing the existing vLLM RoPE class without any new infrastructure.

Both features are no-ops for inputs within the training length and only activate for longer contexts.

Test Plan

  • Add new unit tests
  • E2E test: loaded Qwen/Qwen-7B (which has use_logn_attn: true and use_dynamic_ntk: true) and ran generation successfully

Test Result

E2E test (Qwen/Qwen-7B generation)

$ python3 -c "
from vllm import LLM, SamplingParams
llm = LLM(model='Qwen/Qwen-7B', trust_remote_code=True, gpu_memory_utilization=0.6, enforce_eager=True)
output = llm.generate(['Hello, how are you?'], SamplingParams(max_tokens=50))
for o in output:
    print('Generated:', o.outputs[0].text)
"

Generated:  Have you made up your mind about coming to our reunion?

Model loads and generates coherent text successfully with both use_logn_attn and use_dynamic_ntk enabled.

Changed files

  • tests/model_executor/test_qwen1_logn_attn.py (added, +313/-0)
  • vllm/model_executor/models/qwen.py (modified, +39/-1)

Code Example

"use_logn_attn": true

---

key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
  if key_size > self.seq_length and self.use_logn_attn and not self.training:
      logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :].type_as(query)
      query = query * logn_tensor.expand_as(query)
RAW_BUFFERClick to expand / collapse

Your current environment

None

🐛 Describe the bug

It looks like Qwen1's use_logn_attn behavior may be unsupported in vLLM.

Official Qwen1 configs (for example Qwen-7B) enable:

"use_logn_attn": true

and the original Qwen implementation applies an extra long-context query scaling when the attention length exceeds the configured seq_length:

  key_size = key[0].size(2) if self.use_cache_quantization else key.size(1)
  if key_size > self.seq_length and self.use_logn_attn and not self.training:
      logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :].type_as(query)
      query = query * logn_tensor.expand_as(query)

From computation-graph inspection:

  • the Transformers graph for official Qwen-7B with a long input (key_size > seq_length) clearly contains:
    • ...attn_buffers_logn_tensor_
    • query = query * logn_tensor.expand_as(query)
  • the vLLM graph for the same model contains:
    • positions
    • rotary cache
    • torch.ops.vllm.unified_attention_with_output(...)
  • but I could not find any corresponding logn_attn / logn_tensor path

I may be missing something on the vLLM side. It could be that vLLM does not hit the same key_size > seq_length path as the original Qwen implementation, or that the long-context scaling is already handled implicitly inside the Qwen1 attention backend. If so, a clarification would be very helpful. Otherwise, this seems to suggest that use_logn_attn is currently unsupported for Qwen1.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issue of use_logn_attn being unsupported in vLLM for Qwen1 models, we need to modify the Qwen1 implementation to handle long-context query scaling within the vLLM attention backend.

Step-by-Step Solution:

  1. Check vLLM documentation: Verify if there's an existing mechanism in vLLM to handle long-context scaling that we can leverage.
  2. Modify Qwen1 implementation: If no existing mechanism is found, modify the Qwen1 implementation to apply the long-context query scaling within the vLLM attention backend.
  3. Update use_logn_attn logic: Adjust the logic for use_logn_attn to work with vLLM, considering the differences in attention handling.

Example Code Snippet:

# Assuming we need to integrate logn_attn into vLLM's unified_attention_with_output
import torch
import torch.ops.vllm

class Qwen1Attention(torch.nn.Module):
    def __init__(self, seq_length, use_logn_attn):
        super(Qwen1Attention, self).__init__()
        self.seq_length = seq_length
        self.use_logn_attn = use_logn_attn
        # Initialize logn_tensor if needed

    def forward(self, query, key, seq_start, seq_end):
        key_size = key.size(1)  # Adjust based on actual key dimensions
        if key_size > self.seq_length and self.use_logn_attn and not self.training:
            logn_tensor = self.logn_tensor[:, seq_start:seq_end, :, :].type_as(query)
            query = query * logn_tensor.expand_as(query)
        # Integrate with vLLM's unified attention
        output = torch.ops.vllm.unified_attention_with_output(query, key, value)
        return output

Verification

To verify the fix, inspect the computation graph of the modified Qwen1 model in vLLM to ensure it includes the logn_attn path when use_logn_attn is enabled and the attention length exceeds the configured sequence length.

Extra Tips

  • Consult vLLM documentation and community resources for the most current information on integrating custom attention mechanisms.
  • Test the modified Qwen1 model thoroughly to ensure the long-context query scaling is correctly applied and does not introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING