Fix Action

PR fix notes

PR #41277: Fix error in Dynamic NTK scaling

Repository: vllm-project/vllm
Author: maxdebayser
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41277

Description (problem / solution / changelog)

This is a fix for https://github.com/vllm-project/vllm/issues/41236

The scaling formula was wrong resulting in a constant scaling factor. The nomic embedding models are trained with a shorter length, usually indicated by max_position_embeddings or by max_trained_positions. But can be scaled up to a longer size, indicated by n_positions.

Here are the configurations for some of the models:

nomic-ai/nomic-embed-text-v1:
  max_position_embeddings: 8192,
  n_positions: 8192,

nomic-ai/nomic-embed-text-v1.5:
  max_position_embeddings: 2048,
  max_trained_positions: 2048,
  n_positions: 8192,

nomic-ai/CodeRankEmbed:
  max_trained_positions: 2048,
  n_positions: 8192,

nomic-ai/nomic-embed-text-v2-moe:
  max_trained_positions: 2048,
  n_positions: 2048,

Snowflake/snowflake-arctic-embed-m-long:
  max_trained_positions: 2048,
  n_positions: 8192,

The Nomic text embedding models are trained at a max sequence length of 2048, but are fine tuned up to 8192 with dynamic ntk scaling. In this kind of scaling, the rope base theta is scaled by a factor s which is the extended sequence length divided by the training sequence length. This is further modulated by a factor α, according to the fomula (α * s) - (α - 1). When s is 1, the formula evaluates to 1 meaning that there is no scaling.

With the changes in this PR, nomic-ai/nomic-embed-text-v1 is loaded automatically at 8K as is the case with the sentence-transformer library.

Here is an example script for vLLM:

from transformers import AutoTokenizer
from vllm import LLM

model = "nomic-ai/nomic-embed-text-v1"

with open("gulliver.md") as f:
    gulliver = f.read()

tokenizer = AutoTokenizer.from_pretrained(model)

text = gulliver[:35000]
tokens = tokenizer.encode(text)
print(len(tokens))

llm = LLM(model, trust_remote_code=True)#, enforce_eager=True)
embedding = llm.embed(text)
print(f"{embedding[0].outputs.embedding=}")

and here is the exact same for sentence-transformers:

from transformers import AutoTokenizer
from sentence_transformers import SentenceTransformer

model = "nomic-ai/nomic-embed-text-v1"

with open("gulliver.md") as f:
    gulliver = f.read()

tokenizer = AutoTokenizer.from_pretrained(model)

text = gulliver[:35000]
tokens = tokenizer.encode(text)
print(len(tokens))

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
embeddings = model.encode([text])
print(embeddings)

The example text above has a length of 8041 tokens. Eyeballing the results, they seem to be very close

sentence-transformers:

[[ 3.00992536e-03  9.70433839e-03 -1.21263387e-02 -5.75934537e-02
   4.10853215e-02  1.50936209e-02  4.91085239e-02  2.09600739e-02
...
   5.93488850e-03 -1.33249192e-02  2.76992060e-02  3.74853015e-02
  -2.70690396e-02 -5.02558425e-03  3.29380073e-02 -2.22619344e-02]]

vLLM:

[0.003500884398818016, 0.010343857109546661, -0.010787059552967548, -0.05693305656313896, 0.04164784774184227, 0.015060468576848507, 0.04998728260397911, 0.021168265491724014,
....
 0.006456942297518253, -0.014560637064278126, 0.02623901702463627, 0.03875217214226723, -0.02746732160449028, -0.00527398893609643, 0.034382082521915436, -0.022968538105487823]

cc: @noooop @taneem-ibrahim

Changed files

tests/models/language/pooling/test_nomic_max_model_len.py (modified, +18/-89)
vllm/model_executor/layers/rotary_embedding/__init__.py (modified, +4/-0)
vllm/model_executor/layers/rotary_embedding/dynamic_ntk_scaling_rope.py (modified, +8/-3)
vllm/model_executor/models/config.py (modified, +11/-70)

PR #41301: Fix Dynamic NTK RoPE scaling formula

Repository: vllm-project/vllm
Author: rt23-dev
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/41301

Description (problem / solution / changelog)

Purpose

Fix #41236. The Dynamic NTK RoPE scaling formula in dynamic_ntk_scaling_rope.py was computing a constant instead of scaling with sequence length.

max_len is defined as scaling_factor * max_position_embeddings, so substituting it back in as scaling_factor * max_len / max_position_embeddings simplifies to scaling_factor² — independent of sequence length. The correct formula per the original NTK-Aware scaling paper is (α * seq_len / max_position_embeddings) - (α - 1), where α = scaling_factor. Removing the extra scaling_factor * multiplier restores correct behavior.

Affected models include nomic-ai/nomic-embed-text-v1 and other Nomic embedding models.

Test Plan

from vllm import LLM
llm = LLM(model="nomic-ai/nomic-embed-text-v1", trust_remote_code=True)
output = llm.encode(["Hello world"])
print(output)

Test Result

Model loads and encodes correctly without errors. Base value now scales properly with sequence length rather than returning a constant.

Changed files

vllm/model_executor/layers/rotary_embedding/dynamic_ntk_scaling_rope.py (modified, +1/-1)

max_len = self.max_position_embeddings * self.scaling_factor base = self.base * ( (self.scaling_factor * max_len / self.max_position_embeddings) - (self.scaling_factor - 1)

extent analysis

TL;DR

The issue can be fixed by correcting the formula for NTK Dynamic scaling in the dynamic_ntk_scaling_rope.py file to match the original formula.

Guidance

Verify the original formula for NTK Dynamic scaling is correctly implemented as (α * current sequence length / original model context length) - (α - 1).
Compare this with the current implementation in dynamic_ntk_scaling_rope.py to identify the discrepancy.
Update the code to use the correct formula, ensuring that α (represented by self.scaling_factor) is applied correctly in relation to the sequence length.
Test the affected models, such as nomic-ai/nomic-embed-text-v1, to confirm the fix resolves the issue.

Example

# Corrected formula implementation
max_len = self.max_position_embeddings
base = self.base * (
    (self.scaling_factor * current_sequence_length / max_len)
    - (self.scaling_factor - 1)

Notes

The correction assumes that current_sequence_length represents the current sequence length being processed, which should be used in place of max_len in the original incorrect formula.

Recommendation

Apply the workaround by correcting the formula in dynamic_ntk_scaling_rope.py as described, to ensure accurate NTK Dynamic scaling for affected models.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Dynamic NTK RoPE scaling is wrong [2 pull requests, 7 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #41277: Fix error in Dynamic NTK scaling

Description (problem / solution / changelog)

Changed files

PR #41301: Fix Dynamic NTK RoPE scaling formula

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Dynamic NTK RoPE scaling is wrong [2 pull requests, 7 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #41277: Fix error in Dynamic NTK scaling

Description (problem / solution / changelog)

Changed files

PR #41301: Fix Dynamic NTK RoPE scaling formula

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING