vllm - ✅(Solved) Fix [Bug]: GPTBigCode scale_attn_weights config flag is ignored in vLLM [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36565Fetched 2026-04-08 00:36:17
View on GitHub
Comments
1
Participants
2
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
commented ×1cross-referenced ×1labeled ×1mentioned ×1

Fix Action

Fixed

PR fix notes

PR #36637: [Bugfix] Respect scale_attn_weights config flag in GPTBigCode

Description (problem / solution / changelog)

Purpose

Fix #36565.

The GPTBigCode implementation in vLLM hardcodes the attention scale as 1/sqrt(head_dim), ignoring the scale_attn_weights configuration flag. In HuggingFace Transformers, when scale_attn_weights=False, the scale is set to 1.0 (no scaling). This causes models relying on scale_attn_weights=False to produce different logits compared to Transformers.

This PR:

  • Makes GPTBigCodeAttention respect config.scale_attn_weights to match HuggingFace behavior.
  • Adds assertions for scale_attn_by_inverse_layer_idx and reorder_and_upcast_attn (unsupported in vLLM), consistent with what the GPT-2 model already does.

Test Plan

Test Result

Changed files

  • vllm/model_executor/models/gpt_bigcode.py (modified, +5/-1)

Code Example

self.scale_attn_weights = config.scale_attn_weights
        self.scaling = self.head_dim**-0.5 if config.scale_attn_weights else 1.0
RAW_BUFFERClick to expand / collapse

Your current environment

None

🐛 Describe the bug

Hi, following up on the discussion in #35402.

I noticed that the current implementation of GPTBigCode in vLLM appears to ignore the scale_attn_weights configuration option.

In vllm/model_executor/models/gpt_bigcode.py, the attention scaling is always applied as 1 / sqrt(head_dim):

https://github.com/vllm-project/vllm/blob/e5ff140216272c529261b02b6fd13fc480713735/vllm/model_executor/models/gpt_bigcode.py#L75

However, in the HuggingFace Transformers implementation, this scaling depends on the scale_attn_weights flag in the config:

        self.scale_attn_weights = config.scale_attn_weights
        self.scaling = self.head_dim**-0.5 if config.scale_attn_weights else 1.0

As a result, models that rely on scale_attn_weights=False may produce different logits compared with Transformers.

It might be worth checking whether the GPTBigCode implementation in vLLM should respect the scale_attn_weights config to maintain behavior parity with Transformers.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to modify the GPTBigCode implementation in vllm/model_executor/models/gpt_bigcode.py to respect the scale_attn_weights configuration option.

Steps:

  • Update the __init__ method to accept the scale_attn_weights configuration option
  • Modify the attention scaling calculation to depend on the scale_attn_weights flag

Example Code:

class GPTBigCode:
    def __init__(self, config):
        self.scale_attn_weights = config.scale_attn_weights
        # ...

    def attention(self, query, key, value):
        # ...
        scaling = self.head_dim**-0.5 if self.scale_attn_weights else 1.0
        attention_scores = torch.matmul(query, key.transpose(-1, -2)) * scaling
        # ...

Verification

To verify the fix, compare the logits produced by the updated GPTBigCode implementation with those produced by the HuggingFace Transformers implementation for models that rely on scale_attn_weights=False.

Extra Tips

  • Make sure to update the documentation to reflect the changed behavior.
  • Consider adding tests to ensure that the scale_attn_weights configuration option is correctly applied.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: GPTBigCode scale_attn_weights config flag is ignored in vLLM [1 pull requests, 1 comments, 2 participants]