vllm - 💡(How to fix) Fix [Bug]: `/v1/rerank` ignores `chat_template_kwargs` - Qwen3-Reranker per-task `Instruct` cannot be set per request [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fixed

Code Example

==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : 11.4.0
CMake version                : 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8

==============================
      Python Environment
==============================
Python version               : 3.10.12 (64-bit runtime)
Python platform              : Linux-6.8.0-110-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False   (this host is the API client; the affected server runs on a separate 8× H200 node)
Nvidia driver version        : N/A on this host

==============================
Versions of relevant libraries
==============================
torch==2.10.0
transformers==4.57.6
triton==3.6.0
nvidia-nccl-cu12==2.27.5
flashinfer-python==0.6.6

==============================
         vLLM Info
==============================
vLLM Version                 : 0.18.0
ROCM Version                 : Could not collect
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled

---

<Instruct>: {{ messages | selectattr("role", "eq", "system") | map(attribute="content") | first
              | default("Given a web search query, retrieve relevant passages that answer the query") }}

---

safe_apply_chat_template(
       model_config, tokenizer,
       [{"role": "query",    "content": prompt_1},
        {"role": "document", "content": prompt_2}],
       chat_template=score_template, tools=None, tokenize=False,
   )
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : 11.4.0
CMake version                : 3.22.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.10.0+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8

==============================
      Python Environment
==============================
Python version               : 3.10.12 (64-bit runtime)
Python platform              : Linux-6.8.0-110-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False   (this host is the API client; the affected server runs on a separate 8× H200 node)
Nvidia driver version        : N/A on this host

==============================
Versions of relevant libraries
==============================
torch==2.10.0
transformers==4.57.6
triton==3.6.0
nvidia-nccl-cu12==2.27.5
flashinfer-python==0.6.6

==============================
         vLLM Info
==============================
vLLM Version                 : 0.18.0
ROCM Version                 : Could not collect
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled
</details>

🐛 Describe the bug

The shipped Qwen3-Reranker chat template (examples/pooling/score/template/qwen3_reranker.jinja) contains a placeholder for the per-task Instruct:

<Instruct>: {{ messages | selectattr("role", "eq", "system") | map(attribute="content") | first
              | default("Given a web search query, retrieve relevant passages that answer the query") }}

But there is no way to populate that placeholder per request through /v1/rerank:

  1. RerankRequest inherits chat_template_kwargs (via ClassifyRequestMixin), so clients can already send it - but the rerank handler does not forward it to apply_chat_template. See vllm/entrypoints/pooling/score/utils.py (get_score_prompt): it calls
    safe_apply_chat_template(
        model_config, tokenizer,
        [{"role": "query",    "content": prompt_1},
         {"role": "document", "content": prompt_2}],
        chat_template=score_template, tools=None, tokenize=False,
    )
    request.chat_template_kwargs is never threaded through.
  2. The rerank handler constructs the messages list internally as [query, document] only - no system role produced.
  3. The Qwen3-Reranker model card (https://huggingface.co/Qwen/Qwen3-Reranker-4B) explicitly recommends per-task Instructs as a primary feature, and the offline example (examples/pooling/score/qwen3_reranker_offline.py) formats them manually. The online example (qwen3_reranker_online.py) has no instruction field at all.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: `/v1/rerank` ignores `chat_template_kwargs` - Qwen3-Reranker per-task `Instruct` cannot be set per request [1 pull requests]