vllm - 💡(How to fix) Fix [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0 [1 participants]

vllm2026-04-26 02:33:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40896•Fetched 2026-04-27 05:29:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Yunzez

Participants

Yunzez

Timeline (top)

labeled ×1

With --enable-prefix-caching (the v1 default for supported models), the same /v1/completions request sent multiple times sequentially to the same vLLM server returns 2 distinct outputs at temperature=0:

Run 1 on a freshly started server returns output A.
Runs 2..N return output B ≠ A, but stable across runs.

Restarting the server returns the first request to A. Disabling prefix caching with --no-enable-prefix-caching makes the output deterministic across all runs.

This is a correctness bug, since the documented behavior at temperature=0 is deterministic decoding.

Root Cause

run 1/5: \n\nOkay, let me try to figure out what's going on here. So, there's th run 2/5: \nThe next completion should be deterministic because temperature is z run 3/5: \nThe next completion should be deterministic because temperature is z run 4/5: \nThe next completion should be deterministic because temperature is z run 5/5: \nThe next completion should be deterministic because temperature is z

Code Example

Waiting on our institution cluster to get back to me, someone submitted a ton of jobs blocking this simply script, will update immediately when it's returned, this bug should be able to reproduced without the enviroment, since it bugs out on startup :)

---

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-8B --port 8002 \
    --max-model-len 4096 --gpu-memory-utilization 0.4

---

python repro_for_issue.py --base-url http://localhost:8002 --prompt-len 32

---

runs:       5, prompt_seed: 'repro-issue-001'
temperature: 0, logprobs: 5

  run 1/5:  \n\nOkay, let me try to figure out what's going on here. So, there's th
  run 2/5:  \nThe next completion should be deterministic because temperature is z
  run 3/5:  \nThe next completion should be deterministic because temperature is z
  run 4/5:  \nThe next completion should be deterministic because temperature is z
  run 5/5:  \nThe next completion should be deterministic because temperature is z

unique outputs: 2 / 5
First divergence at character offset 2 (output token 0):
  variant 1 token = ' \n'
  variant 2 token = ' \n\n'

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Waiting on our institution cluster to get back to me, someone submitted a ton of jobs blocking this simply script, will update immediately when it's returned, this bug should be able to reproduced without the enviroment, since it bugs out on startup :)

</details>

🐛 Describe the bug

This bug can be simply circumvented by just having a few warm up rounds, but I still report it here for the record 😶‍🌫️

Brief Env:

vLLM version: 0.19.0 Model: Qwen/Qwen3-8B GPU: NVIDIA H100 80GB (SXM) OS: Linux (RHEL 8) CUDA: 12.x PyTorch: (matches vllm 0.19 requirements)

Summary

Run 1 on a freshly started server returns output A.
Runs 2..N return output B ≠ A, but stable across runs.

Restarting the server returns the first request to A. Disabling prefix caching with --no-enable-prefix-caching makes the output deterministic across all runs.

This is a correctness bug, since the documented behavior at temperature=0 is deterministic decoding.

Server (default v1 config; prefix caching is auto-enabled for Qwen3-8B)

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-8B --port 8002 \
    --max-model-len 4096 --gpu-memory-utilization 0.4

Note: in vLLM v1, enable_prefix_caching defaults to model_config.is_prefix_caching_supported (True for Qwen3-8B). Removing --enable-prefix-caching from the cmdline does NOT disable it; only --no-enable-prefix-caching does.

Reproducer script

repro_for_issue.py repro_for_issue.py sends the same prompt 5 times sequentially with temperature=0, logprobs=5, stream=False, and reports unique-output count:

python repro_for_issue.py --base-url http://localhost:8002 --prompt-len 32

Observed Output (default config, prefix caching ON)

runs:       5, prompt_seed: 'repro-issue-001'
temperature: 0, logprobs: 5

  run 1/5:  \n\nOkay, let me try to figure out what's going on here. So, there's th
  run 2/5:  \nThe next completion should be deterministic because temperature is z
  run 3/5:  \nThe next completion should be deterministic because temperature is z
  run 4/5:  \nThe next completion should be deterministic because temperature is z
  run 5/5:  \nThe next completion should be deterministic because temperature is z

unique outputs: 2 / 5
First divergence at character offset 2 (output token 0):
  variant 1 token = ' \n'
  variant 2 token = ' \n\n'

Re-running the script on the same server (no restart) returns unique = 1 (variant 1 only). Restart returns the pattern.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Disabling prefix caching with --no-enable-prefix-caching may resolve the non-deterministic output issue at temperature=0.

Guidance

The issue seems to be related to prefix caching, as disabling it makes the output deterministic across all runs.
To verify, run the server with --no-enable-prefix-caching and check if the output remains the same for multiple runs.
The provided reproducer script can be used to test the issue and verify the fix.
It's also worth noting that the issue only occurs when temperature=0, so using a non-zero temperature may be a temporary workaround.

Example

No code snippet is provided as the issue is more related to configuration and command-line arguments.

Notes

The issue is specific to the Qwen3-8B model and the vLLM version 0.19.0, so the fix may not apply to other models or versions.

Recommendation

Apply workaround: Disable prefix caching with --no-enable-prefix-caching to ensure deterministic output at temperature=0. This is a reasonable workaround given the provided information and the fact that disabling prefix caching resolves the issue.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Summary

Server (default v1 config; prefix caching is auto-enabled for Qwen3-8B)

Reproducer script

Observed Output (default config, prefix caching ON)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0 [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Summary

Server (default v1 config; prefix caching is auto-enabled for Qwen3-8B)

Reproducer script

Observed Output (default config, prefix caching ON)

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING