vllm - 💡(How to fix) Fix [Bug]: Frontend Abort Fails to Stop Qwen3.5-122B Generation Loop, vLLM Backend Runs Indefinitely with Near-Full GPU Memory Utilization [5 comments, 3 participants]

vllm2026-03-20 08:17:32

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37658•Fetched 2026-04-08 01:04:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5subscribed ×4mentioned ×3labeled ×1

Fix Action

Fix / Workaround

Are there any effective configuration parameters, engine settings, or workaround methods to control this kind of infinite generation dead loop?

Code Example

(APIServer pid=3729549) INFO 03-20 16:08:21 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.60 tokens/s, Drafted throughput: 30.60 tokens/s, Accepted: 306 tokens, Drafted: 306 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.40 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 304 tokens, Drafted: 304 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:41 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%

---

+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX 6000D               On  |   00000000:81:00.0 Off |                    0 |
| N/A   42C    P0            126W /  600W |   84188MiB /  85651MiB |     98%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX 6000D               On  |   00000000:91:00.0 Off |                    0 |
| N/A   42C    P0            127W /  600W |   84188MiB /  85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX 6000D               On  |   00000000:E1:00.0 Off |                    0 |
| N/A   44C    P0            135W /  600W |   84188MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX 6000D               On  |   00000000:F1:00.0 Off |                    0 |
| N/A   43C    P0            127W /  600W |   84186MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |

RAW_BUFFERClick to expand / collapse

Your current environment

Model: Qwen/Qwen3.5-122B-A10B
Inference Framework: vLLM 0.17.1 (official stable release)
GPU Hardware: NVIDIA RTX 6000D (multi-GPU tensor parallelism enabled)
Deployment Mode: vLLM OpenAI-compatible API Server

🐛 Describe the bug

When deploying and running the Qwen/Qwen3.5-122B-A10B model via the vLLM 0.17.1 API server, I have encountered a severe blocking issue: sending an abort or stop generation request from the frontend fails to terminate the backend inference process, and the model falls into an uncontrollable infinite generation loop instead.

The detailed abnormal behavior is as follows:

The frontend triggers a normal generation request to the vLLM API server, and the model starts generating tokens normally.
During the model generation process, I click the abort output/stop generation button on the frontend to terminate the current request immediately.
The frontend stops displaying the output stream as expected, but thebackend vLLM engine does not terminate the generation task at all — the model falls into an infinite generation loop and keeps outputting tokens non-stop.
The backend keeps printing real-time generation metrics logs continuously, showing 1 running request all the time with stable generation throughput, and the request is never cleared from the engine queue.
GPU memory utilization spikes to nearly 100% and remains stuck at this high level, causing extreme resource occupation and affecting subsequent normal requests.

Backend Logs (Abnormal Continuous Output)

(APIServer pid=3729549) INFO 03-20 16:08:21 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.60 tokens/s, Drafted throughput: 30.60 tokens/s, Accepted: 306 tokens, Drafted: 306 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%
(APIServer pid=3729549) INFO 03-20 16:08:31 [metrics.py:100] SpecDecoding metrics: Mean acceptance length: 3.00, Accepted throughput: 30.40 tokens/s, Drafted throughput: 30.40 tokens/s, Accepted: 304 tokens, Drafted: 304 tokens, Per-position acceptance rate: 1.000, 1.000, Avg Draft acceptance rate: 100.0%
(APIServer pid=3729549) INFO 03-20 16:08:41 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 45.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 11.9%, Prefix cache hit rate: 0.0%

GPU Memory & Utilization Status

+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA RTX 6000D               On  |   00000000:81:00.0 Off |                    0 |
| N/A   42C    P0            126W /  600W |   84188MiB /  85651MiB |     98%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA RTX 6000D               On  |   00000000:91:00.0 Off |                    0 |
| N/A   42C    P0            127W /  600W |   84188MiB /  85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA RTX 6000D               On  |   00000000:E1:00.0 Off |                    0 |
| N/A   44C    P0            135W /  600W |   84188MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA RTX 6000D               On  |   00000000:F1:00.0 Off |                    0 |
| N/A   43C    P0            127W /  600W |   84186MiB / 85651MiB |     99%      Default |
|                                         |                        |             Disabled |

Core Question

Are there any effective configuration parameters, engine settings, or workaround methods to control this kind of infinite generation dead loop?

I need a reliable way to:

Forcefully terminate the stuck generation request immediately when receiving an abort signal from the frontend
Prevent the model from entering an endless token output loop
Release occupied GPU resources and clear the stuck request from the vLLM engine queue

Any official suggestions, configuration tuning tips or temporary fixes for this issue would be highly appreciated.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the infinite generation loop issue, we need to implement a mechanism to forcefully terminate the stuck generation request and release occupied GPU resources. Here are the steps:

Modify the vLLM API server to handle abort requests:
- Implement a signal handler to catch the abort signal from the frontend and terminate the generation process.
- Use a library like signal in Python to handle signals.
Use a timeout mechanism to prevent infinite loops:
- Set a timeout for each generation request using a library like timeout-decorator in Python.
- If the generation process exceeds the timeout, terminate it and release resources.
Implement a queue management system:
- Use a library like queue in Python to manage the generation requests.
- When an abort signal is received, remove the request from the queue and terminate the generation process.
Release occupied GPU resources:
- Use the torch.cuda module to release GPU resources when a generation process is terminated.

Example code snippet:

import signal
import timeout_decorator
import torch

# Define a signal handler to catch abort signals
def signal_handler(sig, frame):
    # Terminate the generation process and release resources
    generation_process.terminate()
    torch.cuda.empty_cache()

# Set up the signal handler
signal.signal(signal.SIGABRT, signal_handler)

# Define a timeout decorator to prevent infinite loops
@timeout_decorator.timeout(60)  # 1-minute timeout
def generate_tokens():
    # Generation code here
    pass

# Implement a queue management system
from queue import Queue

generation_queue = Queue()

def generate_tokens():
    # Get the next request from the queue
    request = generation_queue.get()
    try:
        # Generate tokens
        generate_tokens()
    finally:
        # Release resources and remove the request from the queue
        torch.cuda.empty_cache()
        generation_queue.task_done()

# Start the generation process
generation_process = multiprocessing.Process(target=generate_tokens)
generation_process.start()

Verification

To verify that the fix worked, test the following scenarios:

Send an abort signal from the frontend and verify that the generation process terminates immediately.
Verify that the GPU resources are released after the generation process is terminated.
Test the timeout mechanism by setting a short timeout and verifying that the generation process is terminated after the timeout expires.

Extra Tips

Make sure to handle exceptions and errors properly to prevent the generation process from getting stuck in an infinite loop.
Consider implementing a retry mechanism to handle failed generation requests.
Monitor the GPU resources and adjust the timeout and queue management system as needed to prevent resource exhaustion.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #tool integration #LLM response #prompt template #agent execution

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Frontend Abort Fails to Stop Qwen3.5-122B Generation Loop, vLLM Backend Runs Indefinitely with Near-Full GPU Memory Utilization [5 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Backend Logs (Abnormal Continuous Output)

GPU Memory & Utilization Status

Core Question

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Frontend Abort Fails to Stop Qwen3.5-122B Generation Loop, vLLM Backend Runs Indefinitely with Near-Full GPU Memory Utilization [5 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Backend Logs (Abnormal Continuous Output)

GPU Memory & Utilization Status

Core Question

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING