vllm - ✅(Solved) Fix [Bug]: Phi qk_layernorm appears to be unsupported in vLLM [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37852Fetched 2026-04-08 01:17:37
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1referenced ×1

Fix Action

Fixed

PR fix notes

PR #37870: fix: add qk_layernorm support for Phi models

Description (problem / solution / changelog)

Summary

  • Fixes #37852 — Phi qk_layernorm was unsupported in vLLM, causing silent correctness issues for Phi checkpoints with config.qk_layernorm=True.
  • Adds conditional q_layernorm / k_layernorm (nn.LayerNorm(head_dim)) modules to PhiAttention, applied after QKV projection and split, before rotary embedding — matching the Transformers reference implementation exactly.
  • No-op for existing models where qk_layernorm is False (the default).

Transformers reference

In transformers/models/phi/modeling_phi.py, PhiAttention does:

# __init__
self.qk_layernorm = config.qk_layernorm
if self.qk_layernorm:
    self.q_layernorm = nn.LayerNorm(self.head_dim, ...)
    self.k_layernorm = nn.LayerNorm(self.head_dim, ...)

# forward
if self.qk_layernorm:
    query_states = self.q_layernorm(query_states)
    key_states = self.k_layernorm(key_states)

This PR mirrors the same behavior in vLLM, following the existing pattern used by PersimmonAttention in vllm/model_executor/models/persimmon.py.

Changes

In vllm/model_executor/models/phi.py (PhiAttention):

__init__: Read config.qk_layernorm (defaults to False). If True, create self.q_layernorm and self.k_layernorm as nn.LayerNorm(head_size).

forward: After qkv.chunk(3), if qk_layernorm is enabled:

  1. Reshape q/k from [seq_len, hidden_size] to [seq_len, num_heads, head_dim]
  2. Apply per-head LayerNorm
  3. Merge back to [seq_len, hidden_size]
  4. Then proceed to rotary embedding as before

Test plan

  • Verify with a Phi model checkpoint that has qk_layernorm=True (e.g., compare logits against HF Transformers output)
  • Verify existing Phi models (qk_layernorm=False) are unaffected (no new modules created, identical code path)

🤖 Generated with Claude Code

Changed files

  • vllm/model_executor/models/phi.py (modified, +13/-0)

Code Example

self.qk_layernorm = config.qk_layernorm
  if self.qk_layernorm:
      self.q_layernorm = nn.LayerNorm(...)
      self.k_layernorm = nn.LayerNorm(...)


  if self.qk_layernorm:
      query_states = self.q_layernorm(query_states)
      key_states = self.k_layernorm(key_states)

---

qkv, _ = self.qkv_proj(hidden_states)
  q, k, v = qkv.chunk(chunks=3, dim=-1)
  q, k = self.rotary_emb(position_ids, q, k)
  attn_output = self.attn(q, k, v)
RAW_BUFFERClick to expand / collapse

Your current environment

This appears to be a model-implementation / config-compliance issue.

🐛 Describe the bug

It looks like Phi's qk_layernorm behavior may be unsupported in vLLM.

In the Transformers Phi implementation, when config.qk_layernorm=True, the model creates per-head q_layernorm / k_layernorm modules and applies them before rotary embedding:

self.qk_layernorm = config.qk_layernorm
if self.qk_layernorm:
    self.q_layernorm = nn.LayerNorm(...)
    self.k_layernorm = nn.LayerNorm(...)


if self.qk_layernorm:
    query_states = self.q_layernorm(query_states)
    key_states = self.k_layernorm(key_states)

However, in vllm/model_executor/models/phi.py, the current Phi attention path appears to be:

  qkv, _ = self.qkv_proj(hidden_states)
  q, k, v = qkv.chunk(chunks=3, dim=-1)
  q, k = self.rotary_emb(position_ids, q, k)
  attn_output = self.attn(q, k, v)

There is no corresponding q_layernorm / k_layernorm branch.

As a result, Phi configs/checkpoints with qk_layernorm=True may silently produce different attention behavior from Transformers.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to add the missing q_layernorm and k_layernorm modules to the Phi attention path in vllm/model_executor/models/phi.py.

Here are the steps:

  • Add q_layernorm and k_layernorm modules to the Phi class:
self.qk_layernorm = config.qk_layernorm
if self.qk_layernorm:
    self.q_layernorm = nn.LayerNorm(...)
    self.k_layernorm = nn.LayerNorm(...)
  • Apply q_layernorm and k_layernorm before rotary embedding:
qkv, _ = self.qkv_proj(hidden_states)
q, k, v = qkv.chunk(chunks=3, dim=-1)
if self.qk_layernorm:
    q = self.q_layernorm(q)
    k = self.k_layernorm(k)
q, k = self.rotary_emb(position_ids, q, k)
attn_output = self.attn(q, k, v)

Verification

To verify the fix, you can:

  • Test the model with config.qk_layernorm=True and check if the attention behavior matches the Transformers implementation.
  • Compare the output of the model with and without qk_layernorm to ensure that the fix does not introduce any regressions.

Extra Tips

  • Make sure to update the documentation to reflect the changes to the Phi class and its behavior.
  • Consider adding tests to ensure that the qk_layernorm branch is correctly applied in different scenarios.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING