vllm - ✅(Solved) Fix [Bug] Garbage output for long prompts after #35216 [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37732Fetched 2026-04-08 01:08:30
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
2
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1mentioned ×1renamed ×1

Commit 12fd17eb5 (#35216, "[compile] Initialize passes at VllmBackend init") causes garbage logits for prompts exceeding ~2048 tokens when using -O3 compilation level. The regression was identified via git bisect across 33 commits.

Root Cause

The commit moves configure_post_pass() from __call__ (torch.compile time) to __init__ (backend init). With -O3, this triggers AllReduceFusionPass.__init__() which calls get_fi_ar_workspace() to allocate FlashInfer GPU workspace.

The problem is allocation timing relative to memory profiling:

StepBefore (#35216)After (#35216)
1. VllmBackend.__init__()No workspace allocconfigure_post_pass() → allocates FlashInfer workspace
2. Model profilingClears/reorganizes GPU memoryClears/reorganizes GPU memory → workspace invalidated
3. KV cache allocationNormalNormal
4. torch.compile __call__()configure_post_pass() → allocates workspace (valid)Skipped (already called)
5. InferenceValid workspace ✅Dangling pointer → garbage

Short prompts may fit within memory regions that happen to not be overwritten; longer prompts exercise more of the workspace and hit the corrupted memory.

Fix Action

Fixed

PR fix notes

PR #37733: Revert "[compile] Initialize passes at VllmBackend init"

Description (problem / solution / changelog)

Reverts vllm-project/vllm#35216

see #37732

Changed files

  • tests/test_config.py (modified, +2/-2)
  • vllm/compilation/backends.py (modified, +3/-12)
  • vllm/compilation/decorators.py (modified, +0/-5)

Code Example

# Start vLLM with -O3
vllm serve nvidia/Kimi-K2.5-NVFP4 -O3

# Works (short prompt, ~1500 tokens):
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nvidia/Kimi-K2.5-NVFP4",
  "messages": [{"role": "system", "content": "You are helpful."},
               {"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 50
}'
# → Coherent response ✅

# Fails (long prompt, ~2500+ tokens):
python3 -c "
import json, requests
body = {
    'model': 'nvidia/Kimi-K2.5-NVFP4',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant. ' * 400},
        {'role': 'user', 'content': 'What is 2+2?'}
    ],
    'max_tokens': 50, 'temperature': 0
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=body).json()
print(r['choices'][0]['message'])
"
# → Garbage output: random Chinese/English fragments, "_poly", "fab", "Purdue" etc. 

---

ed359c497 (good)c57d38d60 (bad), 33 commits

12fd17eb5198708523008dda6809143d0f7234ed is the first bad commit
[compile] Initialize passes at VllmBackend init (#35216)

---

# backends.py __init__: remove early configure_post_pass()
# backends.py __call__: restore self.configure_post_pass() before graph processing
RAW_BUFFERClick to expand / collapse

Summary

Commit 12fd17eb5 (#35216, "[compile] Initialize passes at VllmBackend init") causes garbage logits for prompts exceeding ~2048 tokens when using -O3 compilation level. The regression was identified via git bisect across 33 commits.

Environment

  • vLLM version: 0.17.2rc1.dev195 (commit c57d38d60)
  • Model: nvidia/Kimi-K2.5-NVFP4 (MLA architecture, kv_lora_rank=512)
  • Hardware: NVIDIA GB200
  • Launch flags: -O3 --compilation_config.pass_config.enable_qk_norm_rope_fusion true

Reproduction

# Start vLLM with -O3
vllm serve nvidia/Kimi-K2.5-NVFP4 -O3

# Works (short prompt, ~1500 tokens):
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nvidia/Kimi-K2.5-NVFP4",
  "messages": [{"role": "system", "content": "You are helpful."},
               {"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 50
}'
# → Coherent response ✅

# Fails (long prompt, ~2500+ tokens):
python3 -c "
import json, requests
body = {
    'model': 'nvidia/Kimi-K2.5-NVFP4',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant. ' * 400},
        {'role': 'user', 'content': 'What is 2+2?'}
    ],
    'max_tokens': 50, 'temperature': 0
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=body).json()
print(r['choices'][0]['message'])
"
# → Garbage output: random Chinese/English fragments, "_poly", "fab", "Purdue" etc. ❌

The threshold is sharp: ~2340 prompt tokens = coherent, ~2350 = garbage. Garbage persists even at temperature=0, confirming corrupt logits rather than a sampling issue.

Bisect Result

ed359c497 (good) → c57d38d60 (bad), 33 commits

12fd17eb5198708523008dda6809143d0f7234ed is the first bad commit
[compile] Initialize passes at VllmBackend init (#35216)

Reverting this single commit on c57d38d60 fixes the issue completely.

Root Cause Analysis

The commit moves configure_post_pass() from __call__ (torch.compile time) to __init__ (backend init). With -O3, this triggers AllReduceFusionPass.__init__() which calls get_fi_ar_workspace() to allocate FlashInfer GPU workspace.

The problem is allocation timing relative to memory profiling:

StepBefore (#35216)After (#35216)
1. VllmBackend.__init__()No workspace allocconfigure_post_pass() → allocates FlashInfer workspace
2. Model profilingClears/reorganizes GPU memoryClears/reorganizes GPU memory → workspace invalidated
3. KV cache allocationNormalNormal
4. torch.compile __call__()configure_post_pass() → allocates workspace (valid)Skipped (already called)
5. InferenceValid workspace ✅Dangling pointer → garbage

Short prompts may fit within memory regions that happen to not be overwritten; longer prompts exercise more of the workspace and hit the corrupted memory.

Suggested Fix

Quick fix: Revert the configure_post_pass() move back to __call__:

# backends.py __init__: remove early configure_post_pass()
# backends.py __call__: restore self.configure_post_pass() before graph processing

Proper fix: Keep pass registration in __init__ for latency reporting, but make FlashInfer workspace allocation lazy — defer get_fi_ar_workspace() until first use (after profiling completes).

cc @angelayi

extent analysis

Fix Plan

To resolve the issue, we can implement the following steps:

  • Revert the configure_post_pass() move: Move configure_post_pass() back to __call__ to ensure that the FlashInfer workspace allocation occurs after memory profiling.
  • Lazy allocation of FlashInfer workspace: Defer get_fi_ar_workspace() until first use to prevent workspace invalidation.

Code Changes

Here's an example of how you can implement the fixes:

# backends.py
class VllmBackend:
    def __init__(self, ...):
        # Remove early configure_post_pass() call
        # self.configure_post_pass()

    def __call__(self, ...):
        # Restore self.configure_post_pass() before graph processing
        self.configure_post_pass()
        # ... rest of the method remains the same

# Lazy allocation of FlashInfer workspace
class AllReduceFusionPass:
    def __init__(self, ...):
        # Do not allocate workspace here
        # self.workspace = get_fi_ar_workspace()

    def get_workspace(self):
        # Allocate workspace lazily
        if not hasattr(self, 'workspace'):
            self.workspace = get_fi_ar_workspace()
        return self.workspace

Verification

To verify that the fix worked, you can run the reproduction script with the modified code and check that the output is coherent for both short and long prompts.

# Run vLLM with -O3 and modified code
vllm serve nvidia/Kimi-K2.5-NVFP4 -O3

# Test with short prompt
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nvidia/Kimi-K2.5-NVFP4",
  "messages": [{"role": "system", "content": "You are helpful."},
               {"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 50
}'

# Test with long prompt
python3 -c "
import json, requests
body = {
    'model': 'nvidia/Kimi-K2.5-NVFP4',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant. ' * 400},
        {'role': 'user', 'content': 'What is 2+2?'}
    ],
    'max_tokens': 50, 'temperature': 0
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=body).json()
print(r['choices'][0]['message'])
"

Extra Tips

  • Make sure to test the modified code thoroughly to ensure that it works correctly for all scenarios.
  • Consider adding additional logging or debugging statements to help identify any future issues

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING