vllm - 💡(How to fix) Fix [Bug]: vLLM v1 with prefix caching: first request differs from subsequent identical requests at temperature=0 [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40896Fetched 2026-04-27 05:29:30
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

With --enable-prefix-caching (the v1 default for supported models), the same /v1/completions request sent multiple times sequentially to the same vLLM server returns 2 distinct outputs at temperature=0:

  • Run 1 on a freshly started server returns output A.
  • Runs 2..N return output B ≠ A, but stable across runs.

Restarting the server returns the first request to A. Disabling prefix caching with --no-enable-prefix-caching makes the output deterministic across all runs.

This is a correctness bug, since the documented behavior at temperature=0 is deterministic decoding.

Root Cause

run 1/5: \n\nOkay, let me try to figure out what's going on here. So, there's th run 2/5: \nThe next completion should be deterministic because temperature is z run 3/5: \nThe next completion should be deterministic because temperature is z run 4/5: \nThe next completion should be deterministic because temperature is z run 5/5: \nThe next completion should be deterministic because temperature is z

Code Example

Waiting on our institution cluster to get back to me, someone submitted a ton of jobs blocking this simply script, will update immediately when it's returned, this bug should be able to reproduced without the enviroment, since it bugs out on startup :)

---

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-8B --port 8002 \
    --max-model-len 4096 --gpu-memory-utilization 0.4

---

python repro_for_issue.py --base-url http://localhost:8002 --prompt-len 32

---

runs:       5, prompt_seed: 'repro-issue-001'
temperature: 0, logprobs: 5

  run 1/5:  \n\nOkay, let me try to figure out what's going on here. So, there's th
  run 2/5:  \nThe next completion should be deterministic because temperature is z
  run 3/5:  \nThe next completion should be deterministic because temperature is z
  run 4/5:  \nThe next completion should be deterministic because temperature is z
  run 5/5:  \nThe next completion should be deterministic because temperature is z

unique outputs: 2 / 5
First divergence at character offset 2 (output token 0):
  variant 1 token = ' \n'
  variant 2 token = ' \n\n'
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Waiting on our institution cluster to get back to me, someone submitted a ton of jobs blocking this simply script, will update immediately when it's returned, this bug should be able to reproduced without the enviroment, since it bugs out on startup :)
</details>

🐛 Describe the bug

This bug can be simply circumvented by just having a few warm up rounds, but I still report it here for the record 😶‍🌫️

Brief Env:

vLLM version: 0.19.0 Model: Qwen/Qwen3-8B GPU: NVIDIA H100 80GB (SXM) OS: Linux (RHEL 8) CUDA: 12.x PyTorch: (matches vllm 0.19 requirements)

Summary

With --enable-prefix-caching (the v1 default for supported models), the same /v1/completions request sent multiple times sequentially to the same vLLM server returns 2 distinct outputs at temperature=0:

  • Run 1 on a freshly started server returns output A.
  • Runs 2..N return output B ≠ A, but stable across runs.

Restarting the server returns the first request to A. Disabling prefix caching with --no-enable-prefix-caching makes the output deterministic across all runs.

This is a correctness bug, since the documented behavior at temperature=0 is deterministic decoding.

Server (default v1 config; prefix caching is auto-enabled for Qwen3-8B)

python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen3-8B --port 8002 \
    --max-model-len 4096 --gpu-memory-utilization 0.4

Note: in vLLM v1, enable_prefix_caching defaults to model_config.is_prefix_caching_supported (True for Qwen3-8B). Removing --enable-prefix-caching from the cmdline does NOT disable it; only --no-enable-prefix-caching does.

Reproducer script

repro_for_issue.py repro_for_issue.py sends the same prompt 5 times sequentially with temperature=0, logprobs=5, stream=False, and reports unique-output count:

python repro_for_issue.py --base-url http://localhost:8002 --prompt-len 32

Observed Output (default config, prefix caching ON)

runs:       5, prompt_seed: 'repro-issue-001'
temperature: 0, logprobs: 5

  run 1/5:  \n\nOkay, let me try to figure out what's going on here. So, there's th
  run 2/5:  \nThe next completion should be deterministic because temperature is z
  run 3/5:  \nThe next completion should be deterministic because temperature is z
  run 4/5:  \nThe next completion should be deterministic because temperature is z
  run 5/5:  \nThe next completion should be deterministic because temperature is z

unique outputs: 2 / 5
First divergence at character offset 2 (output token 0):
  variant 1 token = ' \n'
  variant 2 token = ' \n\n'

Re-running the script on the same server (no restart) returns unique = 1 (variant 1 only). Restart returns the pattern.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Disabling prefix caching with --no-enable-prefix-caching may resolve the non-deterministic output issue at temperature=0.

Guidance

  • The issue seems to be related to prefix caching, as disabling it makes the output deterministic across all runs.
  • To verify, run the server with --no-enable-prefix-caching and check if the output remains the same for multiple runs.
  • The provided reproducer script can be used to test the issue and verify the fix.
  • It's also worth noting that the issue only occurs when temperature=0, so using a non-zero temperature may be a temporary workaround.

Example

No code snippet is provided as the issue is more related to configuration and command-line arguments.

Notes

The issue is specific to the Qwen3-8B model and the vLLM version 0.19.0, so the fix may not apply to other models or versions.

Recommendation

Apply workaround: Disable prefix caching with --no-enable-prefix-caching to ensure deterministic output at temperature=0. This is a reasonable workaround given the provided information and the fact that disabling prefix caching resolves the issue.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING