vllm - ✅(Solved) Fix [Bug] Garbage output for long prompts after #35216 [1 pull requests, 1 participants]

vllm2026-03-21 03:26:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37732•Fetched 2026-04-08 01:08:30

View on GitHub

Comments

Participants

Timeline

Reactions

Author

esmeetu

Participants

esmeetu

Timeline (top)

cross-referenced ×1labeled ×1mentioned ×1renamed ×1

Commit 12fd17eb5 (#35216, "[compile] Initialize passes at VllmBackend init") causes garbage logits for prompts exceeding ~2048 tokens when using -O3 compilation level. The regression was identified via git bisect across 33 commits.

Root Cause

The commit moves configure_post_pass() from __call__ (torch.compile time) to __init__ (backend init). With -O3, this triggers AllReduceFusionPass.__init__() which calls get_fi_ar_workspace() to allocate FlashInfer GPU workspace.

The problem is allocation timing relative to memory profiling:

Step	Before (#35216)	After (#35216)
1. `VllmBackend.__init__()`	No workspace alloc	`configure_post_pass()` → allocates FlashInfer workspace
2. Model profiling	Clears/reorganizes GPU memory	Clears/reorganizes GPU memory → workspace invalidated
3. KV cache allocation	Normal	Normal
4. `torch.compile __call__()`	`configure_post_pass()` → allocates workspace (valid)	Skipped (already called)
5. Inference	Valid workspace ✅	Dangling pointer → garbage ❌

Short prompts may fit within memory regions that happen to not be overwritten; longer prompts exercise more of the workspace and hit the corrupted memory.

Fix Action

Fixed

Fixed by PR: Revert "[compile] Initialize passes at VllmBackend init" (https://github.com/vllm-project/vllm/pull/37733)

PR fix notes

PR #37733: Revert "[compile] Initialize passes at VllmBackend init"

Repository: vllm-project/vllm
Author: simon-mo
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37733

Description (problem / solution / changelog)

Reverts vllm-project/vllm#35216

see #37732

Changed files

tests/test_config.py (modified, +2/-2)
vllm/compilation/backends.py (modified, +3/-12)
vllm/compilation/decorators.py (modified, +0/-5)

Code Example

# Start vLLM with -O3
vllm serve nvidia/Kimi-K2.5-NVFP4 -O3

# Works (short prompt, ~1500 tokens):
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nvidia/Kimi-K2.5-NVFP4",
  "messages": [{"role": "system", "content": "You are helpful."},
               {"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 50
}'
# → Coherent response ✅

# Fails (long prompt, ~2500+ tokens):
python3 -c "
import json, requests
body = {
    'model': 'nvidia/Kimi-K2.5-NVFP4',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant. ' * 400},
        {'role': 'user', 'content': 'What is 2+2?'}
    ],
    'max_tokens': 50, 'temperature': 0
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=body).json()
print(r['choices'][0]['message'])
"
# → Garbage output: random Chinese/English fragments, "_poly", "fab", "Purdue" etc. ❌

---

ed359c497 (good) → c57d38d60 (bad), 33 commits

12fd17eb5198708523008dda6809143d0f7234ed is the first bad commit
[compile] Initialize passes at VllmBackend init (#35216)

---

# backends.py __init__: remove early configure_post_pass()
# backends.py __call__: restore self.configure_post_pass() before graph processing

RAW_BUFFERClick to expand / collapse

Summary

Environment

vLLM version: 0.17.2rc1.dev195 (commit c57d38d60)
Model: nvidia/Kimi-K2.5-NVFP4 (MLA architecture, kv_lora_rank=512)
Hardware: NVIDIA GB200
Launch flags: -O3 --compilation_config.pass_config.enable_qk_norm_rope_fusion true

Reproduction

# Start vLLM with -O3
vllm serve nvidia/Kimi-K2.5-NVFP4 -O3

# Works (short prompt, ~1500 tokens):
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nvidia/Kimi-K2.5-NVFP4",
  "messages": [{"role": "system", "content": "You are helpful."},
               {"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 50
}'
# → Coherent response ✅

# Fails (long prompt, ~2500+ tokens):
python3 -c "
import json, requests
body = {
    'model': 'nvidia/Kimi-K2.5-NVFP4',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant. ' * 400},
        {'role': 'user', 'content': 'What is 2+2?'}
    ],
    'max_tokens': 50, 'temperature': 0
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=body).json()
print(r['choices'][0]['message'])
"
# → Garbage output: random Chinese/English fragments, "_poly", "fab", "Purdue" etc. ❌

The threshold is sharp: ~2340 prompt tokens = coherent, ~2350 = garbage. Garbage persists even at temperature=0, confirming corrupt logits rather than a sampling issue.

Bisect Result

ed359c497 (good) → c57d38d60 (bad), 33 commits

12fd17eb5198708523008dda6809143d0f7234ed is the first bad commit
[compile] Initialize passes at VllmBackend init (#35216)

Reverting this single commit on c57d38d60 fixes the issue completely.

Root Cause Analysis

The problem is allocation timing relative to memory profiling:

Step	Before (#35216)	After (#35216)
1. `VllmBackend.__init__()`	No workspace alloc	`configure_post_pass()` → allocates FlashInfer workspace
2. Model profiling	Clears/reorganizes GPU memory	Clears/reorganizes GPU memory → workspace invalidated
3. KV cache allocation	Normal	Normal
4. `torch.compile __call__()`	`configure_post_pass()` → allocates workspace (valid)	Skipped (already called)
5. Inference	Valid workspace ✅	Dangling pointer → garbage ❌

Short prompts may fit within memory regions that happen to not be overwritten; longer prompts exercise more of the workspace and hit the corrupted memory.

Suggested Fix

Quick fix: Revert the configure_post_pass() move back to __call__:

# backends.py __init__: remove early configure_post_pass()
# backends.py __call__: restore self.configure_post_pass() before graph processing

Proper fix: Keep pass registration in __init__ for latency reporting, but make FlashInfer workspace allocation lazy — defer get_fi_ar_workspace() until first use (after profiling completes).

cc @angelayi

extent analysis

Fix Plan

To resolve the issue, we can implement the following steps:

Revert the configure_post_pass() move: Move configure_post_pass() back to __call__ to ensure that the FlashInfer workspace allocation occurs after memory profiling.
Lazy allocation of FlashInfer workspace: Defer get_fi_ar_workspace() until first use to prevent workspace invalidation.

Code Changes

Here's an example of how you can implement the fixes:

# backends.py
class VllmBackend:
    def __init__(self, ...):
        # Remove early configure_post_pass() call
        # self.configure_post_pass()

    def __call__(self, ...):
        # Restore self.configure_post_pass() before graph processing
        self.configure_post_pass()
        # ... rest of the method remains the same

# Lazy allocation of FlashInfer workspace
class AllReduceFusionPass:
    def __init__(self, ...):
        # Do not allocate workspace here
        # self.workspace = get_fi_ar_workspace()

    def get_workspace(self):
        # Allocate workspace lazily
        if not hasattr(self, 'workspace'):
            self.workspace = get_fi_ar_workspace()
        return self.workspace

Verification

To verify that the fix worked, you can run the reproduction script with the modified code and check that the output is coherent for both short and long prompts.

# Run vLLM with -O3 and modified code
vllm serve nvidia/Kimi-K2.5-NVFP4 -O3

# Test with short prompt
curl http://localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
  "model": "nvidia/Kimi-K2.5-NVFP4",
  "messages": [{"role": "system", "content": "You are helpful."},
               {"role": "user", "content": "What is 2+2?"}],
  "max_tokens": 50
}'

# Test with long prompt
python3 -c "
import json, requests
body = {
    'model': 'nvidia/Kimi-K2.5-NVFP4',
    'messages': [
        {'role': 'system', 'content': 'You are a helpful assistant. ' * 400},
        {'role': 'user', 'content': 'What is 2+2?'}
    ],
    'max_tokens': 50, 'temperature': 0
}
r = requests.post('http://localhost:8000/v1/chat/completions', json=body).json()
print(r['choices'][0]['message'])
"

Extra Tips

Make sure to test the modified code thoroughly to ensure that it works correctly for all scenarios.
Consider adding additional logging or debugging statements to help identify any future issues

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #response parsing #generation error #database connection #vector store #embedding generation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug] Garbage output for long prompts after #35216 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #37733: Revert "[compile] Initialize passes at VllmBackend init"

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Environment

Reproduction

Bisect Result

Root Cause Analysis

Suggested Fix

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug] Garbage output for long prompts after #35216 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #37733: Revert "[compile] Initialize passes at VllmBackend init"

Description (problem / solution / changelog)

Changed files

Code Example

Summary

Environment

Reproduction

Bisect Result

Root Cause Analysis

Suggested Fix

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING