vllm - 💡(How to fix) Fix [Bug]: `/v1/rerank` ignores `chat_template_kwargs` - Qwen3-Reranker per-task `Instruct` cannot be set per request [1 pull requests]

vllm2026-05-12 07:56:49

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix /v1/rerank ignoring chat_template_kwargs (e.g. Qwen3-Reranker instructions) (https://github.com/vllm-project/vllm/pull/42412)

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : 11.4.0
CMake version                : 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8

==============================
      Python Environment
==============================
Python version               : 3.10.12 (64-bit runtime)
Python platform              : Linux-6.8.0-110-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False   (this host is the API client; the affected server runs on a separate 8× H200 node)
Nvidia driver version        : N/A on this host

==============================
Versions of relevant libraries
==============================
torch==2.10.0
transformers==4.57.6
triton==3.6.0
nvidia-nccl-cu12==2.27.5
flashinfer-python==0.6.6

==============================
         vLLM Info
==============================
vLLM Version                 : 0.18.0
ROCM Version                 : Could not collect
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

---

<Instruct>: {{ messages | selectattr("role", "eq", "system") | map(attribute="content") | first
              | default("Given a web search query, retrieve relevant passages that answer the query") }}

---

safe_apply_chat_template(
       model_config, tokenizer,
       [{"role": "query",    "content": prompt_1},
        {"role": "document", "content": prompt_2}],
       chat_template=score_template, tools=None, tokenize=False,
   )

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : 11.4.0
CMake version                : 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8

==============================
      Python Environment
==============================
Python version               : 3.10.12 (64-bit runtime)
Python platform              : Linux-6.8.0-110-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False   (this host is the API client; the affected server runs on a separate 8× H200 node)
Nvidia driver version        : N/A on this host

==============================
Versions of relevant libraries
==============================
torch==2.10.0
transformers==4.57.6
triton==3.6.0
nvidia-nccl-cu12==2.27.5
flashinfer-python==0.6.6

==============================
         vLLM Info
==============================
vLLM Version                 : 0.18.0
ROCM Version                 : Could not collect
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

</details>

🐛 Describe the bug

The shipped Qwen3-Reranker chat template (examples/pooling/score/template/qwen3_reranker.jinja) contains a placeholder for the per-task Instruct:

<Instruct>: {{ messages | selectattr("role", "eq", "system") | map(attribute="content") | first
              | default("Given a web search query, retrieve relevant passages that answer the query") }}

But there is no way to populate that placeholder per request through /v1/rerank:

RerankRequest inherits chat_template_kwargs (via ClassifyRequestMixin), so clients can already send it - but the rerank handler does not forward it to apply_chat_template. See vllm/entrypoints/pooling/score/utils.py (get_score_prompt): it calls
```
safe_apply_chat_template(
    model_config, tokenizer,
    [{"role": "query",    "content": prompt_1},
     {"role": "document", "content": prompt_2}],
    chat_template=score_template, tools=None, tokenize=False,
)
```
request.chat_template_kwargs is never threaded through.
The rerank handler constructs the messages list internally as [query, document] only - no system role produced.
The Qwen3-Reranker model card (https://huggingface.co/Qwen/Qwen3-Reranker-4B) explicitly recommends per-task Instructs as a primary feature, and the offline example (examples/pooling/score/qwen3_reranker_offline.py) formats them manually. The online example (qwen3_reranker_online.py) has no instruction field at all.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #memory management #API rate limit #retriever error #indexing error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: `/v1/rerank` ignores `chat_template_kwargs` - Qwen3-Reranker per-task `Instruct` cannot be set per request [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Your current environment

🐛 Describe the bug

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: `/v1/rerank` ignores `chat_template_kwargs` - Qwen3-Reranker per-task `Instruct` cannot be set per request [1 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

Code Example

Your current environment

🐛 Describe the bug

Still need to ship something?

RELATED_DISCOVERY

TRENDING