vllm - ✅(Solved) Fix [Bug]: Phi qk_layernorm appears to be unsupported in vLLM [1 pull requests, 1 participants]

Qi-Zhan · 2026-03-23T05:05:07Z

[vllm] PR 37870: fix: add qk layernorm support for Phi models - Repository: vllm-project/vllm - Author: gambletan - State: open | merged: False - Link: https:/… # PR #37870: fix: add qk_layernorm support for Phi models - Repository: vllm-project/vllm - Author: gambletan - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/37870 ## Description (problem / solution / changelog) ## Summary - Fixes #37852 — Phi `qk_layernorm` was unsupported in vLLM, causing silent correctness issues for Phi checkpoints with `config.qk_layernorm=True`. - Adds conditional `q_layernorm` / `k_layernorm` (`nn.LayerNorm(head_dim)`) modules to `PhiAttention`, applied after QKV projection and split, before rotary embedding — matching the Transformers reference implementation exactly. - No-op for existing models where `qk_layernorm` is `False` (the default). ## Transformers reference In `transformers/models/phi/modeling_phi.py`, `PhiAttention` does: ```python # __init__ self.qk_layernorm = config.qk_layernorm if self.qk_layernorm: self.q_layernorm = nn.LayerNorm(self.head_dim, ...) self.k_layernorm = nn.LayerNorm(self.head_dim, ...) # forward if self.qk_layernorm: query_states = self.q_layernorm(query_states) key_states = self.k_layernorm(key_states) ``` This PR mirrors the same behavior in vLLM, following the existing pattern used by `PersimmonAttention` in `vllm/model_executor/models/persimmon.py`. ## Changes In `vllm/model_executor/models/phi.py` (`PhiAttention`): **`__init__`**: Read `config.qk_layernorm` (defaults to `False`). If `True`, create `self.q_layernorm` and `self.k_layernorm` as `nn.LayerNorm(head_size)`. **`forward`**: After `qkv.chunk(3)`, if `qk_layernorm` is enabled: 1. Reshape `q`/`k` from `[seq_len, hidden_size]` to `[seq_len, num_heads, head_dim]` 2. Apply per-head LayerNorm 3. Merge back to `[seq_len, hidden_size]` 4. Then proceed to rotary embedding as before ## Test plan - [ ] Verify with a Phi model checkpoint that has `qk_layernorm=True` (e.g., compare logits against HF Transformers output) - [ ] Verify existing Phi models (`qk_layernorm=False`) are unaffected (no new modules created, identical code path) 🤖 Generated with [Claude Code](https://claude.com/claude-code) ## Changed files - `vllm/model_executor/models/phi.py` (modified, +13/-0) ## Fixed - Fixed by PR: fix: add qk_layernorm support for Phi models (https://github.com/vllm-project/vllm/pull/37870) ### Your current environment This appears to be a model-implementation / config-compliance issue. ### 🐛 Describe the bug It looks like Phi's `qk_layernorm` behavior may be unsupported in vLLM. In the Transformers Phi implementation, when `config.qk_layernorm=True`, the model creates per-head `q_layernorm` / `k_layernorm` modules and applies them before rotary embedding: ```python self.qk_layernorm = config.qk_layernorm if self.qk_layernorm: self.q_layernorm = nn.LayerNorm(...) self.k_layernorm = nn.LayerNorm(...) if self.qk_layernorm: query_states = self.q_layernorm(query_states) key_states = self.k_layernorm(key_states) ``` However, in vllm/model_executor/models/phi.py, the current Phi attention path appears to be: ``` qkv, _ = self.qkv_proj(hidden_states) q, k, v = qkv.chunk(chunks=3, dim=-1) q, k = self.rotary_emb(position_ids, q, k) attn_output = self.attn(q, k, v) ``` There is no corresponding q_layernorm / k_layernorm branch. As a result, Phi configs/checkpoints with qk_layernorm=True may silently produce different attention behavior from Transformers. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-23 05:05:07

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37852•Fetched 2026-04-08 01:17:37

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Qi-Zhan

Participants

Qi-Zhan

Timeline (top)

cross-referenced ×1labeled ×1referenced ×1

Fix Action

Fixed

Fixed by PR: fix: add qk_layernorm support for Phi models (https://github.com/vllm-project/vllm/pull/37870)

PR fix notes

PR #37870: fix: add qk_layernorm support for Phi models

Repository: vllm-project/vllm
Author: gambletan
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37870

Description (problem / solution / changelog)

Summary

Fixes #37852 — Phi qk_layernorm was unsupported in vLLM, causing silent correctness issues for Phi checkpoints with config.qk_layernorm=True.
Adds conditional q_layernorm / k_layernorm (nn.LayerNorm(head_dim)) modules to PhiAttention, applied after QKV projection and split, before rotary embedding — matching the Transformers reference implementation exactly.
No-op for existing models where qk_layernorm is False (the default).

Transformers reference

In transformers/models/phi/modeling_phi.py, PhiAttention does:

# __init__
self.qk_layernorm = config.qk_layernorm
if self.qk_layernorm:
    self.q_layernorm = nn.LayerNorm(self.head_dim, ...)
    self.k_layernorm = nn.LayerNorm(self.head_dim, ...)

# forward
if self.qk_layernorm:
    query_states = self.q_layernorm(query_states)
    key_states = self.k_layernorm(key_states)

This PR mirrors the same behavior in vLLM, following the existing pattern used by PersimmonAttention in vllm/model_executor/models/persimmon.py.

Changes

In vllm/model_executor/models/phi.py (PhiAttention):

__init__: Read config.qk_layernorm (defaults to False). If True, create self.q_layernorm and self.k_layernorm as nn.LayerNorm(head_size).

forward: After qkv.chunk(3), if qk_layernorm is enabled:

Reshape q/k from [seq_len, hidden_size] to [seq_len, num_heads, head_dim]
Apply per-head LayerNorm
Merge back to [seq_len, hidden_size]
Then proceed to rotary embedding as before

Test plan

Verify with a Phi model checkpoint that has qk_layernorm=True (e.g., compare logits against HF Transformers output)
Verify existing Phi models (qk_layernorm=False) are unaffected (no new modules created, identical code path)

🤖 Generated with Claude Code

Changed files

vllm/model_executor/models/phi.py (modified, +13/-0)

Code Example

self.qk_layernorm = config.qk_layernorm
  if self.qk_layernorm:
      self.q_layernorm = nn.LayerNorm(...)
      self.k_layernorm = nn.LayerNorm(...)


  if self.qk_layernorm:
      query_states = self.q_layernorm(query_states)
      key_states = self.k_layernorm(key_states)

---

qkv, _ = self.qkv_proj(hidden_states)
  q, k, v = qkv.chunk(chunks=3, dim=-1)
  q, k = self.rotary_emb(position_ids, q, k)
  attn_output = self.attn(q, k, v)

RAW_BUFFERClick to expand / collapse

Your current environment

This appears to be a model-implementation / config-compliance issue.

🐛 Describe the bug

It looks like Phi's qk_layernorm behavior may be unsupported in vLLM.

In the Transformers Phi implementation, when config.qk_layernorm=True, the model creates per-head q_layernorm / k_layernorm modules and applies them before rotary embedding:

self.qk_layernorm = config.qk_layernorm
if self.qk_layernorm:
    self.q_layernorm = nn.LayerNorm(...)
    self.k_layernorm = nn.LayerNorm(...)


if self.qk_layernorm:
    query_states = self.q_layernorm(query_states)
    key_states = self.k_layernorm(key_states)

However, in vllm/model_executor/models/phi.py, the current Phi attention path appears to be:

  qkv, _ = self.qkv_proj(hidden_states)
  q, k, v = qkv.chunk(chunks=3, dim=-1)
  q, k = self.rotary_emb(position_ids, q, k)
  attn_output = self.attn(q, k, v)

There is no corresponding q_layernorm / k_layernorm branch.

As a result, Phi configs/checkpoints with qk_layernorm=True may silently produce different attention behavior from Transformers.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue, we need to add the missing q_layernorm and k_layernorm modules to the Phi attention path in vllm/model_executor/models/phi.py.

Here are the steps:

Add q_layernorm and k_layernorm modules to the Phi class:

self.qk_layernorm = config.qk_layernorm
if self.qk_layernorm:
    self.q_layernorm = nn.LayerNorm(...)
    self.k_layernorm = nn.LayerNorm(...)

Apply q_layernorm and k_layernorm before rotary embedding:

qkv, _ = self.qkv_proj(hidden_states)
q, k, v = qkv.chunk(chunks=3, dim=-1)
if self.qk_layernorm:
    q = self.q_layernorm(q)
    k = self.k_layernorm(k)
q, k = self.rotary_emb(position_ids, q, k)
attn_output = self.attn(q, k, v)

Verification

To verify the fix, you can:

Test the model with config.qk_layernorm=True and check if the attention behavior matches the Transformers implementation.
Compare the output of the model with and without qk_layernorm to ensure that the fix does not introduce any regressions.

Extra Tips

Make sure to update the documentation to reflect the changes to the Phi class and its behavior.
Consider adding tests to ensure that the qk_layernorm branch is correctly applied in different scenarios.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#installation #tensor shape #autograd error #model save/load #optimization

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Phi qk_layernorm appears to be unsupported in vLLM [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #37870: fix: add qk_layernorm support for Phi models

Description (problem / solution / changelog)

Summary

Transformers reference

Changes

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Phi qk_layernorm appears to be unsupported in vLLM [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #37870: fix: add qk_layernorm support for Phi models

Description (problem / solution / changelog)

Summary

Transformers reference

Changes

Test plan

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING