vllm - 💡(How to fix) Fix [RFC]: Add `max_tokens_per_doc` support for rerank and scoring endpoints [2 comments, 2 participants]

jefp · 2026-03-31T17:39:01Z

[vllm] Motivation vLLM's reranking and scoring endpoints currently have no mechanism to truncate documents before processing. When users send long documents th… ## Motivation vLLM's reranking and scoring endpoints currently have no mechanism to truncate documents before processing. When users send long documents that exceed the model's context window, requests fail or produce degraded results. Both the [Cohere Rerank API](https://docs.cohere.com/reference/rerank) and [Jina Reranker API](https://jina.ai/en-US/reranker/) support a `max_tokens_per_doc` (or equivalent) parameter that truncates each document to a specified token limit before scoring. This is a standard feature in production reranking APIs that vLLM should support. PR #33315 by @hustxiayang introduced an initial implementation of this feature. This RFC formalizes the design — particularly around offline support, `PoolingParams` integration, and score template compatibility — to align on the approach and move toward merging. ## Proposed Change Add a `max_tokens_per_doc` parameter to the rerank/score request schema that truncates each document's token representation before model inference. The implementation should support both **online** (HTTP API) and **offline** (`LLM.score()` / `LLM.embed()`) usage paths. ### API Surface **Online (HTTP):** Add `max_tokens_per_doc: Optional[int]` to the rerank and score request schemas. ```json { "model": "BAAI/bge-reranker-v2-m3", "query": "What is deep learning?", "documents": ["A very long document...", "Another long document..."], "max_tokens_per_doc": 512 } ``` **Offline:** Expose via `PoolingParams` so it is available in both online and offline scenarios: ```python from vllm import LLM llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score") results = llm.score( "What is deep learning?", ["A very long document...", "Another long document..."], pooling_params=PoolingParams(max_tokens_per_doc=512), ) ``` ### Truncation Behavior - Tokenize each document using `tokenizer.encode()` (per DarkLight1337's suggestion in #33315) - Truncate the token list to `max_tokens_per_doc` tokens - Apply truncation **before** prompt/template assembly to ensure it respects the per-document limit - If `max_tokens_per_doc` is `None` or not set, no truncation occurs (backward compatible) ### Files to Modify | File | Change | |------|--------| | `vllm/entrypoints/pooling/score/serving.py` | Add truncation logic before scoring | | `vllm/pooling_params.py` | Add `max_tokens_per_doc` field to `PoolingParams` | | `vllm/entrypoints/openai/protocol.py` | Add field to rerank/score request models | | `tests/models/language/pooling/` | Add tests for truncation with cross-encoders and bi-encoders | ### Compatibility This must work correctly with score templates introduced in #30550 and #31335. Specifically: - Truncation must happen **per-document, before template expansion** — the template tokens (query prefix, separators, etc.) should not count toward the `max_tokens_per_doc` limit - Tests should cover both template-based and non-template scoring models ### Feedback Period. _No response_ ### CC List. @noooop @DarkLight1337 @hustxiayang ### Any Other Things. - Prior art: PR #33315 contains a working implementation that handles bi-encoder and cross-encoder architectures. The core truncation logic can be reused. - The Cohere `/v2/rerank` [documentation](https://docs.cohere.com/reference/rerank) defaults `max_tokens_per_doc` to `4096`. We should **not** set a default and leave it as `None` to preserve backward compatibility. - Future extension: this pattern could also apply to embedding endpoints (`/v1/embeddings`) for long-document truncation, but that is out of scope for this RFC. ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Code Example

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "What is deep learning?",
  "documents": ["A very long document...", "Another long document..."],
  "max_tokens_per_doc": 512
}

---

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
results = llm.score(
    "What is deep learning?",
    ["A very long document...", "Another long document..."],
    pooling_params=PoolingParams(max_tokens_per_doc=512),
)

Motivation

vLLM's reranking and scoring endpoints currently have no mechanism to truncate documents before processing. When users send long documents that exceed the model's context window, requests fail or produce degraded results.

Both the Cohere Rerank API and Jina Reranker API support a max_tokens_per_doc (or equivalent) parameter that truncates each document to a specified token limit before scoring. This is a standard feature in production reranking APIs that vLLM should support.

PR #33315 by @hustxiayang introduced an initial implementation of this feature. This RFC formalizes the design — particularly around offline support, PoolingParams integration, and score template compatibility — to align on the approach and move toward merging.

Proposed Change

Add a max_tokens_per_doc parameter to the rerank/score request schema that truncates each document's token representation before model inference. The implementation should support both online (HTTP API) and offline (LLM.score() / LLM.embed()) usage paths.

API Surface

Online (HTTP): Add max_tokens_per_doc: Optional[int] to the rerank and score request schemas.

{
  "model": "BAAI/bge-reranker-v2-m3",
  "query": "What is deep learning?",
  "documents": ["A very long document...", "Another long document..."],
  "max_tokens_per_doc": 512
}

Offline: Expose via PoolingParams so it is available in both online and offline scenarios:

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
results = llm.score(
    "What is deep learning?",
    ["A very long document...", "Another long document..."],
    pooling_params=PoolingParams(max_tokens_per_doc=512),
)

Truncation Behavior

Tokenize each document using tokenizer.encode() (per DarkLight1337's suggestion in #33315)
Truncate the token list to max_tokens_per_doc tokens
Apply truncation before prompt/template assembly to ensure it respects the per-document limit
If max_tokens_per_doc is None or not set, no truncation occurs (backward compatible)

Files to Modify

File	Change
`vllm/entrypoints/pooling/score/serving.py`	Add truncation logic before scoring
`vllm/pooling_params.py`	Add `max_tokens_per_doc` field to `PoolingParams`
`vllm/entrypoints/openai/protocol.py`	Add field to rerank/score request models
`tests/models/language/pooling/`	Add tests for truncation with cross-encoders and bi-encoders

Compatibility

This must work correctly with score templates introduced in #30550 and #31335. Specifically:

Truncation must happen per-document, before template expansion — the template tokens (query prefix, separators, etc.) should not count toward the max_tokens_per_doc limit
Tests should cover both template-based and non-template scoring models

Feedback Period.

No response

CC List.

@noooop @DarkLight1337 @hustxiayang

Any Other Things.

Prior art: PR #33315 contains a working implementation that handles bi-encoder and cross-encoder architectures. The core truncation logic can be reused.
The Cohere /v2/rerank documentation defaults max_tokens_per_doc to 4096. We should not set a default and leave it as None to preserve backward compatibility.
Future extension: this pattern could also apply to embedding endpoints (/v1/embeddings) for long-document truncation, but that is out of scope for this RFC.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Implementing a max_tokens_per_doc parameter in the rerank and score request schema to truncate documents before processing can resolve the issue of handling long documents that exceed the model's context window.

Guidance

Add a max_tokens_per_doc parameter to the rerank and score request schemas for both online (HTTP API) and offline (LLM.score() / LLM.embed()) usage paths.
Implement truncation logic using tokenizer.encode() to truncate each document's token representation before model inference.
Ensure truncation occurs before prompt/template assembly and respects the per-document limit.
Update relevant files, including vllm/entrypoints/pooling/score/serving.py, vllm/pooling_params.py, vllm/entrypoints/openapi/protocol.py, and tests/models/language/pooling/, to support the new parameter and truncation logic.

Example

from vllm import LLM

llm = LLM(model="BAAI/bge-reranker-v2-m3", task="score")
results = llm.score(
    "What is deep learning?",
    ["A very long document...", "Another long document..."],
    pooling_params=PoolingParams(max_tokens_per_doc=512),
)

Notes

The implementation should preserve backward compatibility by not setting a default value for max_tokens_per_doc and leaving it as None if not provided.

Recommendation

Apply the workaround by implementing the max_tokens_per_doc parameter and truncation logic to handle long documents and prevent requests from failing or producing degraded results.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [RFC]: Add `max_tokens_per_doc` support for rerank and scoring endpoints [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Motivation

Proposed Change

API Surface

Truncation Behavior

Files to Modify

Compatibility

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [RFC]: Add `max_tokens_per_doc` support for rerank and scoring endpoints [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Motivation

Proposed Change

API Surface

Truncation Behavior

Files to Modify

Compatibility

Feedback Period.

CC List.

Any Other Things.

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING