vllm - 💡(How to fix) Fix [Bug]: Frontend Abort Fails to Stop Qwen3.5-122B Generation Loop, vLLM Backend Runs Indefinitely with Near-Full GPU Memory Utilization [5 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37658Fetched 2026-04-08 01:04:19
View on GitHub
Comments
5
Participants
3
Timeline
13
Reactions
1
Timeline (top)
commented ×5subscribed ×4mentioned ×3labeled ×1

Fix Action

Fix / Workaround

Are there any effective configuration parameters, engine settings, or workaround methods to control this kind of infinite generation dead loop?

Code Example

(APIServer pid=3729549) INFO 03-20 16:08:21 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.60 tokens/s, Drafted throughput: 30.60 tokens/s, Accepted: 306 tokens, Drafted: 306 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.40 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 304 tokens, Drafted: 304 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:41 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%

---

+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX 6000D               On  |   00000000:81:00.0 Off |                    0 |
| N/A   42C    P0            126W /  600W |   84188MiB /  85651MiB |     98%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX 6000D               On  |   00000000:91:00.0 Off |                    0 |
| N/A   42C    P0            127W /  600W |   84188MiB /  85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX 6000D               On  |   00000000:E1:00.0 Off |                    0 |
| N/A   44C    P0            135W /  600W |   84188MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX 6000D               On  |   00000000:F1:00.0 Off |                    0 |
| N/A   43C    P0            127W /  600W |   84186MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |
RAW_BUFFERClick to expand / collapse

Your current environment

  • Model: Qwen/Qwen3.5-122B-A10B

  • Inference Framework: vLLM 0.17.1 (official stable release)

  • GPU Hardware: NVIDIA RTX 6000D (multi-GPU tensor parallelism enabled)

  • Deployment Mode: vLLM OpenAI-compatible API Server

🐛 Describe the bug

When deploying and running the Qwen/Qwen3.5-122B-A10B model via the vLLM 0.17.1 API server, I have encountered a severe blocking issue: sending an abort or stop generation request from the frontend fails to terminate the backend inference process, and the model falls into an uncontrollable infinite generation loop instead.

The detailed abnormal behavior is as follows:

  1. The frontend triggers a normal generation request to the vLLM API server, and the model starts generating tokens normally.

  2. During the model generation process, I click the abort output/stop generation button on the frontend to terminate the current request immediately.

  3. The frontend stops displaying the output stream as expected, but thebackend vLLM engine does not terminate the generation task at all — the model falls into an infinite generation loop and keeps outputting tokens non-stop.

  4. The backend keeps printing real-time generation metrics logs continuously, showing 1 running request all the time with stable generation throughput, and the request is never cleared from the engine queue.

  5. GPU memory utilization spikes to nearly 100% and remains stuck at this high level, causing extreme resource occupation and affecting subsequent normal requests.

Backend Logs (Abnormal Continuous Output)

(APIServer pid=3729549) INFO 03-20 16:08:21 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.60 tokens/s, Drafted throughput: 30.60 tokens/s, Accepted: 306 tokens, Drafted: 306 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.40 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 304 tokens, Drafted: 304 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:41 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%

GPU Memory & Utilization Status

+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX 6000D               On  |   00000000:81:00.0 Off |                    0 |
| N/A   42C    P0            126W /  600W |   84188MiB /  85651MiB |     98%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX 6000D               On  |   00000000:91:00.0 Off |                    0 |
| N/A   42C    P0            127W /  600W |   84188MiB /  85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX 6000D               On  |   00000000:E1:00.0 Off |                    0 |
| N/A   44C    P0            135W /  600W |   84188MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX 6000D               On  |   00000000:F1:00.0 Off |                    0 |
| N/A   43C    P0            127W /  600W |   84186MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |

Core Question

Are there any effective configuration parameters, engine settings, or workaround methods to control this kind of infinite generation dead loop?

I need a reliable way to:

  • Forcefully terminate the stuck generation request immediately when receiving an abort signal from the frontend

  • Prevent the model from entering an endless token output loop

  • Release occupied GPU resources and clear the stuck request from the vLLM engine queue

Any official suggestions, configuration tuning tips or temporary fixes for this issue would be highly appreciated.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the infinite generation loop issue, we need to implement a mechanism to forcefully terminate the stuck generation request and release occupied GPU resources. Here are the steps:

  • Modify the vLLM API server to handle abort requests:
    • Implement a signal handler to catch the abort signal from the frontend and terminate the generation process.
    • Use a library like signal in Python to handle signals.
  • Use a timeout mechanism to prevent infinite loops:
    • Set a timeout for each generation request using a library like timeout-decorator in Python.
    • If the generation process exceeds the timeout, terminate it and release resources.
  • Implement a queue management system:
    • Use a library like queue in Python to manage the generation requests.
    • When an abort signal is received, remove the request from the queue and terminate the generation process.
  • Release occupied GPU resources:
    • Use the torch.cuda module to release GPU resources when a generation process is terminated.

Example code snippet:

import signal
import timeout_decorator
import torch

# Define a signal handler to catch abort signals
def signal_handler(sig, frame):
    # Terminate the generation process and release resources
    generation_process.terminate()
    torch.cuda.empty_cache()

# Set up the signal handler
signal.signal(signal.SIGABRT, signal_handler)

# Define a timeout decorator to prevent infinite loops
@timeout_decorator.timeout(60)  # 1-minute timeout
def generate_tokens():
    # Generation code here
    pass

# Implement a queue management system
from queue import Queue

generation_queue = Queue()

def generate_tokens():
    # Get the next request from the queue
    request = generation_queue.get()
    try:
        # Generate tokens
        generate_tokens()
    finally:
        # Release resources and remove the request from the queue
        torch.cuda.empty_cache()
        generation_queue.task_done()

# Start the generation process
generation_process = multiprocessing.Process(target=generate_tokens)
generation_process.start()

Verification

To verify that the fix worked, test the following scenarios:

  • Send an abort signal from the frontend and verify that the generation process terminates immediately.
  • Verify that the GPU resources are released after the generation process is terminated.
  • Test the timeout mechanism by setting a short timeout and verifying that the generation process is terminated after the timeout expires.

Extra Tips

  • Make sure to handle exceptions and errors properly to prevent the generation process from getting stuck in an infinite loop.
  • Consider implementing a retry mechanism to handle failed generation requests.
  • Monitor the GPU resources and adjust the timeout and queue management system as needed to prevent resource exhaustion.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING