vllm - 💡(How to fix) Fix [RFC]: Add `max_tokens_per_doc` support for rerank and scoring endpoints [2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38651Fetched 2026-04-08 01:58:46
View on GitHub
Comments
2
Participants
2
Timeline
10
Reactions
0
Author
Participants
Assignees
Timeline (top)
mentioned ×3subscribed ×3commented ×2assigned ×1

Code Example

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "What is deep learning?",
  "documents": ["A very long document...", "Another long document..."],
  "max_tokens_per_doc": 512
}

---

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
results = llm.score(
    "What is deep learning?",
    ["A very long document...", "Another long document..."],
    pooling_params=PoolingParams(max_tokens_per_doc=512),
)
RAW_BUFFERClick to expand / collapse

Motivation

vLLM's reranking and scoring endpoints currently have no mechanism to truncate documents before processing. When users send long documents that exceed the model's context window, requests fail or produce degraded results.

Both the Cohere Rerank API and Jina Reranker API support a max_tokens_per_doc (or equivalent) parameter that truncates each document to a specified token limit before scoring. This is a standard feature in production reranking APIs that vLLM should support.

PR #33315 by @hustxiayang introduced an initial implementation of this feature. This RFC formalizes the design — particularly around offline support, PoolingParams integration, and score template compatibility — to align on the approach and move toward merging.

Proposed Change

Add a max_tokens_per_doc parameter to the rerank/score request schema that truncates each document's token representation before model inference. The implementation should support both online (HTTP API) and offline (LLM.score() / LLM.embed()) usage paths.

API Surface

Online (HTTP): Add max_tokens_per_doc: Optional[int] to the rerank and score request schemas.

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "What is deep learning?",
  "documents": ["A very long document...", "Another long document..."],
  "max_tokens_per_doc": 512
}

Offline: Expose via PoolingParams so it is available in both online and offline scenarios:

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
results = llm.score(
    "What is deep learning?",
    ["A very long document...", "Another long document..."],
    pooling_params=PoolingParams(max_tokens_per_doc=512),
)

Truncation Behavior

  • Tokenize each document using tokenizer.encode() (per DarkLight1337's suggestion in #33315)
  • Truncate the token list to max_tokens_per_doc tokens
  • Apply truncation before prompt/template assembly to ensure it respects the per-document limit
  • If max_tokens_per_doc is None or not set, no truncation occurs (backward compatible)

Files to Modify

FileChange
vllm/entrypoints/pooling/score/serving.pyAdd truncation logic before scoring
vllm/pooling_params.pyAdd max_tokens_per_doc field to PoolingParams
vllm/entrypoints/openai/protocol.pyAdd field to rerank/score request models
tests/models/language/pooling/Add tests for truncation with cross-encoders and bi-encoders

Compatibility

This must work correctly with score templates introduced in #30550 and #31335. Specifically:

  • Truncation must happen per-document, before template expansion — the template tokens (query prefix, separators, etc.) should not count toward the max_tokens_per_doc limit
  • Tests should cover both template-based and non-template scoring models

Feedback Period.

No response

CC List.

@noooop @DarkLight1337 @hustxiayang

Any Other Things.

  • Prior art: PR #33315 contains a working implementation that handles bi-encoder and cross-encoder architectures. The core truncation logic can be reused.
  • The Cohere /v2/rerank documentation defaults max_tokens_per_doc to 4096. We should not set a default and leave it as None to preserve backward compatibility.
  • Future extension: this pattern could also apply to embedding endpoints (/v1/embeddings) for long-document truncation, but that is out of scope for this RFC.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing a max_tokens_per_doc parameter in the rerank and score request schema to truncate documents before processing can resolve the issue of handling long documents that exceed the model's context window.

Guidance

  • Add a max_tokens_per_doc parameter to the rerank and score request schemas for both online (HTTP API) and offline (LLM.score() / LLM.embed()) usage paths.
  • Implement truncation logic using tokenizer.encode() to truncate each document's token representation before model inference.
  • Ensure truncation occurs before prompt/template assembly and respects the per-document limit.
  • Update relevant files, including vllm/entrypoints/pooling/score/serving.py, vllm/pooling_params.py, vllm/entrypoints/openapi/protocol.py, and tests/models/language/pooling/, to support the new parameter and truncation logic.

Example

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
results = llm.score(
    "What is deep learning?",
    ["A very long document...", "Another long document..."],
    pooling_params=PoolingParams(max_tokens_per_doc=512),
)

Notes

The implementation should preserve backward compatibility by not setting a default value for max_tokens_per_doc and leaving it as None if not provided.

Recommendation

Apply the workaround by implementing the max_tokens_per_doc parameter and truncation logic to handle long documents and prevent requests from failing or producing degraded results.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING