vllm - 💡(How to fix) Fix [Bug]: ReRank API online inference doesn't work well with given template [3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39784Fetched 2026-04-16 06:36:40
View on GitHub
Comments
3
Participants
2
Timeline
5
Reactions
0
Assignees
Timeline (top)
commented ×3assigned ×1labeled ×1

Code Example

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Reranker-4B \
  --host 127.0.0.1 \
  --port 8021 \
  --runner pooling \
  --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' \
  --chat-template qwen3_reranker.jinja \
  --enable-log-requests

---

import requests

url = "http://127.0.0.1:8021/v1/rerank"
payload = {
    "model": "Qwen/Qwen3-Reranker-4B",
    "query": "星之卡比",
    "documents": [
        "key: kirby-hero-20260310-234352.png_1408x768\n\ndesc: A cute pink Kirby character from Nintendo game, round pink ball with big eyes, cheerful expression, simple cartoon style, on pastel pink background, kawaii style, clean illustration"
    ]
}
resp = requests.post(url, json=payload, timeout=120)
print(resp.status_code)
print(resp.text)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary> ROCM Version : Could not collect vLLM Version : 0.11.2 vLLM Build Flags: CUDA Archs: Not Set; ROCm: Disabled; XPU: Disabled GPU Topology: GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 CPU Affinity NUMA Affinity GPU NUMA ID GPU0 X NODE NODE NODE SYS SYS SYS SYS 0-175 0 N/A GPU1 NODE X PIX NODE SYS SYS SYS SYS 0-175 0 N/A GPU2 NODE PIX X NODE SYS SYS SYS SYS 0-175 0 N/A GPU3 NODE NODE NODE X SYS SYS SYS SYS 0-175 0 N/A GPU4 SYS SYS SYS SYS X NODE NODE NODE 176-351 1 N/A GPU5 SYS SYS SYS SYS NODE X PIX NODE 176-351 1 N/A GPU6 SYS SYS SYS SYS NODE PIX X NODE 176-351 1 N/A GPU7 SYS SYS SYS SYS NODE NODE NODE X 176-351 1 N/A

Legend:

X = Self SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI) NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU) PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge) PIX = Connection traversing at most a single PCIe bridge NV# = Connection traversing a bonded set of # NVLinks

</details>

🐛 Describe the bug

Describe the bug

I am serving Qwen/Qwen3-Reranker-4B with vllm serve and a custom --chat-template intended for Qwen3-Reranker scoring. The template is successfully loaded at startup, but during inference:

  1. The reranking scores are identical to the case where no template is specified.
  2. The logged prompt does not appear to be the rendered/template-expanded prompt.
  3. In debug / warning logs, I can see the template source being loaded, but the Jinja variables (for example {{ messages | ... }}) are still printed literally instead of being rendered into the final prompt.

This makes it look like the score template is loaded but not actually applied during /rerank or /score inference.

According to the docs/examples, Qwen3-Reranker should support score templates in vLLM, and the official example shows using vllm serve ... --runner pooling --chat-template ... for online scoring/reranking.

How I start my online vllm

CUDA_VISIBLE_DEVICES=0 vllm serve Qwen/Qwen3-Reranker-4B \
  --host 127.0.0.1 \
  --port 8021 \
  --runner pooling \
  --hf_overrides '{"architectures":["Qwen3ForSequenceClassification"],"classifier_from_token":["no","yes"],"is_original_qwen3_reranker":true}' \
  --chat-template qwen3_reranker.jinja \
  --enable-log-requests

My request is below:

import requests

url = "http://127.0.0.1:8021/v1/rerank"
payload = {
    "model": "Qwen/Qwen3-Reranker-4B",
    "query": "星之卡比",
    "documents": [
        "key: kirby-hero-20260310-234352.png_1408x768\n\ndesc: A cute pink Kirby character from Nintendo game, round pink ball with big eyes, cheerful expression, simple cartoon style, on pastel pink background, kawaii style, clean illustration"
    ]
}
resp = requests.post(url, json=payload, timeout=120)
print(resp.status_code)
print(resp.text)

What I saw in logs:

<img width="1540" height="58" alt="Image" src="https://github.com/user-attachments/assets/ec2dabce-9c29-4201-891e-a98c252cb178" />

No template infos at all, and the ** scores is exactly the same as no template start up.**

Expected behavior

I would expect one of the following:

The score template is actually rendered and applied during reranking/scoring, which should make the final prompt differ from the no-template case. The logged prompt should reflect the rendered prompt (or there should be a documented way to inspect the rendered prompt). If score templates are not currently applied to /rerank in this setup, the docs should clarify that.

Why I think this may be a bug

The docs/examples indicate that:

Qwen3-Reranker supports score templates in vLLM. There is an official online score example using vllm serve, --runner pooling, and --chat-template. The Qwen3-Reranker example template is documented.

However, in my setup the template seems to be loaded but has no observable effect on prompt rendering or scores.

Additional notes

This is not about Chat Completions API. I understand Qwen3-Reranker should be used with /score or /rerank, not /v1/chat/completions. The main issue is that --chat-template appears to be ignored or not observable in the rerank/score execution path. If this is intended behavior and the logged prompt is expected to be the pre-rendered/raw input, could the docs clarify how to verify that the score template is actually applied?

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue is likely due to the score template not being properly applied during reranking/scoring, and the logged prompt not reflecting the rendered prompt.

Guidance

  • Verify that the --chat-template flag is correctly loaded by checking the startup logs for the template source being loaded.
  • Check the documentation for any specific requirements or limitations for using score templates with Qwen3-Reranker and the /rerank or /score endpoints.
  • Inspect the qwen3_reranker.jinja template file to ensure it is correctly formatted and contains the expected Jinja variables.
  • Consider adding additional logging or debugging statements to the code to verify that the score template is being applied during reranking/scoring.

Example

No code example is provided as the issue is more related to configuration and template rendering.

Notes

The issue may be related to a misunderstanding of how score templates are applied in the Qwen3-Reranker model or a limitation in the current implementation. Further investigation is needed to determine the root cause.

Recommendation

Apply a workaround by modifying the qwen3_reranker.jinja template to include additional logging or debugging statements to verify that the template is being rendered correctly. If the issue persists, consider reaching out to the vLLM community or documentation for further clarification on using score templates with Qwen3-Reranker.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

I would expect one of the following:

The score template is actually rendered and applied during reranking/scoring, which should make the final prompt differ from the no-template case. The logged prompt should reflect the rendered prompt (or there should be a documented way to inspect the rendered prompt). If score templates are not currently applied to /rerank in this setup, the docs should clarify that.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING