vllm - ✅(Solved) Fix [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used [1 pull requests, 1 participants]

vllm2026-05-10 12:06:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#42226•Fetched 2026-05-11 03:13:44

View on GitHub

Comments

Participants

Timeline

Reactions

Author

idan-friedman

Participants

idan-friedman

Timeline (top)

cross-referenced ×1

Root Cause

Root cause: task_queue.join_thread() (line ~1059) deadlocks because:

The drain loop uses mp.Queue.empty() which is documented as unreliable — it can return True while the feeder thread still has items in its internal buffer
join_thread() waits for the feeder thread to finish flushing to the OS pipe, but the pipe is full and no process is reading from it

Fix Action

Fixed

Fixed by PR: [Bugfix] Fix queue cleanup deadlock in multi-turn benchmark (https://github.com/vllm-project/vllm/pull/42227)

PR fix notes

PR #42227: [Bugfix] Fix queue cleanup deadlock in multi-turn benchmark

Repository: vllm-project/vllm
Author: idan-friedman
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/42227

Description (problem / solution / changelog)

Summary

Fixes #42226

When --max-num-requests causes early termination in benchmark_serving_multi_turn.py, unconsumed items in task_queue overflow the OS pipe buffer. join_thread() then deadlocks waiting for the feeder thread, which is blocked writing to a full pipe with no readers.

Changes:

Replace unreliable mp.Queue.empty()-based drain with get_nowait() + double-check loop
Use cancel_join_thread() before close() on all three queues to prevent deadlock
Remove join_thread() calls (unnecessary — all data is already collected by this point)
Drain result_queue with the same robust pattern to avoid losing in-flight metrics
Use await asyncio.sleep() instead of time.sleep() (function is async)

AI disclosure: This PR was developed with assistance from Claude (Anthropic). All code has been reviewed and validated by the submitter.

Testing

End-to-end verification (A100 80GB, NousResearch/Hermes-3-Llama-3.1-8B, 500 ShareGPT conversations):

Test	Code	Result
`--max-num-requests 100` + ShareGPT	Old (unfixed)	Hung 16+ min after "All clients exited" — deadlock
`--max-num-requests 100` + ShareGPT	New (this PR)	Completed, printed full stats summary
No `--max-num-requests` (regression)	New (this PR)	21 conversations processed without issue

Isolated reproducer test (proves the mechanism):

Test	Result
Old pattern (close + join_thread) with pipe overflow	Deadlocks (3s timeout hit)
New pattern (cancel_join_thread + close) with pipe overflow	Completes instantly
New pattern on empty queue	Completes instantly

<details> <summary>Reproducer test script (not part of this PR)</summary>

import multiprocessing as mp
import time

PIPE_OVERFLOW_ITEM_COUNT = 150
ITEM_PAYLOAD = "x" * 500

def _scenario_old_cleanup(item_count):
    q = mp.Queue()
    for i in range(item_count):
        q.put((i, ITEM_PAYLOAD))
    while not q.empty():
        q.get()
    q.close()
    q.join_thread()  # DEADLOCK

def _scenario_new_cleanup(item_count):
    q = mp.Queue()
    for i in range(item_count):
        q.put((i, ITEM_PAYLOAD))
    while True:
        try:
            q.get_nowait()
        except Exception:
            time.sleep(0.1)
            try:
                q.get_nowait()
            except Exception:
                break
    q.cancel_join_thread()
    q.close()

def _scenario_new_cleanup_empty():
    q = mp.Queue()
    for i in range(10):
        q.put((i, ITEM_PAYLOAD))
    time.sleep(0.2)
    while True:
        try:
            q.get_nowait()
        except Exception:
            break
    q.cancel_join_thread()
    q.close()

def _run_with_timeout(target, timeout_sec):
    start = time.monotonic()
    p = mp.Process(target=target)
    p.start()
    p.join(timeout=timeout_sec)
    elapsed = time.monotonic() - start
    if p.is_alive():
        p.terminate()
        p.join()
        return False, elapsed
    return True, elapsed

def test_old_pattern_deadlocks_on_full_pipe():
    completed, _ = _run_with_timeout(
        lambda: _scenario_old_cleanup(PIPE_OVERFLOW_ITEM_COUNT), timeout_sec=3)
    assert not completed

def test_new_pattern_completes_on_full_pipe():
    completed, elapsed = _run_with_timeout(
        lambda: _scenario_new_cleanup(PIPE_OVERFLOW_ITEM_COUNT), timeout_sec=5)
    assert completed and elapsed < 2.0

def test_new_pattern_works_on_empty_queue():
    completed, elapsed = _run_with_timeout(
        _scenario_new_cleanup_empty, timeout_sec=3)
    assert completed and elapsed < 1.0

</details>

Changed files

benchmarks/multi_turn/benchmark_serving_multi_turn.py (modified, +31/-10)

Code Example

python benchmarks/multi_turn/benchmark_serving_multi_turn.py \
    --model NousResearch/Hermes-3-Llama-3.1-8B \
    --url http://localhost:8000 \
    --input-file sharegpt_500.json \
    --num-clients 8 \
    --max-num-requests 2000 \
    --output-file results.json

---

All 8 clients exited (successfully finished 191 out of 500 conversations)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment</summary>

vLLM main branch, commit 48698b1b9 (2026-05-10)
OS: Linux (RHEL 9, kernel 5.14.0)
Python: 3.13
Hardware: Not relevant (benchmark client-side bug, not GPU/inference related)

</details>

🐛 Describe the bug

What happens: When --max-num-requests is used with large conversations (e.g., ShareGPT multi-turn), benchmark_serving_multi_turn.py hangs forever after printing "All N clients exited". No stats summary is produced, no output JSON is written.

How to reproduce:

python benchmarks/multi_turn/benchmark_serving_multi_turn.py \
    --model NousResearch/Hermes-3-Llama-3.1-8B \
    --url http://localhost:8000 \
    --input-file sharegpt_500.json \
    --num-clients 8 \
    --max-num-requests 2000 \
    --output-file results.json

Conditions that trigger it:

--max-num-requests causes early termination (many unconsumed tasks remain in the queue)
Conversations are large enough that unconsumed items overflow the OS pipe buffer (~64KB on Linux)
More conversations queued than will be consumed (num_conversations >> max_num_requests / avg_turns)

Conditions where it does NOT trigger:

Small/synthetic conversations (zipf) that fit in the pipe buffer
No --max-num-requests (all conversations consumed)

Expected behavior: After "All N clients exited", the benchmark prints the statistics summary and writes the output JSON.

Actual behavior: Hangs indefinitely after:

All 8 clients exited (successfully finished 191 out of 500 conversations)

Root cause: task_queue.join_thread() (line ~1059) deadlocks because:

The drain loop uses mp.Queue.empty() which is documented as unreliable — it can return True while the feeder thread still has items in its internal buffer
join_thread() waits for the feeder thread to finish flushing to the OS pipe, but the pipe is full and no process is reading from it

This is the classic multiprocessing deadlock pattern described in the Python docs:

"a process that puts items in a queue will wait before terminating until all the buffered items are fed by the 'feeder' thread to the underlying pipe"

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#output truncation #response parsing #generation error #database connection #vector store

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #42227: [Bugfix] Fix queue cleanup deadlock in multi-turn benchmark

Description (problem / solution / changelog)

Summary

Testing

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fixed

PR fix notes

PR #42227: [Bugfix] Fix queue cleanup deadlock in multi-turn benchmark

Description (problem / solution / changelog)

Summary

Testing

Changed files

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING