vllm - ✅(Solved) Fix [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#42226Fetched 2026-05-11 03:13:44
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
cross-referenced ×1

Root Cause

Root cause: task_queue.join_thread() (line ~1059) deadlocks because:

  1. The drain loop uses mp.Queue.empty() which is documented as unreliable — it can return True while the feeder thread still has items in its internal buffer
  2. join_thread() waits for the feeder thread to finish flushing to the OS pipe, but the pipe is full and no process is reading from it

Fix Action

Fixed

PR fix notes

PR #42227: [Bugfix] Fix queue cleanup deadlock in multi-turn benchmark

Description (problem / solution / changelog)

Summary

Fixes #42226

When --max-num-requests causes early termination in benchmark_serving_multi_turn.py, unconsumed items in task_queue overflow the OS pipe buffer. join_thread() then deadlocks waiting for the feeder thread, which is blocked writing to a full pipe with no readers.

Changes:

  • Replace unreliable mp.Queue.empty()-based drain with get_nowait() + double-check loop
  • Use cancel_join_thread() before close() on all three queues to prevent deadlock
  • Remove join_thread() calls (unnecessary — all data is already collected by this point)
  • Drain result_queue with the same robust pattern to avoid losing in-flight metrics
  • Use await asyncio.sleep() instead of time.sleep() (function is async)

AI disclosure: This PR was developed with assistance from Claude (Anthropic). All code has been reviewed and validated by the submitter.

Testing

End-to-end verification (A100 80GB, NousResearch/Hermes-3-Llama-3.1-8B, 500 ShareGPT conversations):

TestCodeResult
--max-num-requests 100 + ShareGPTOld (unfixed)Hung 16+ min after "All clients exited" — deadlock
--max-num-requests 100 + ShareGPTNew (this PR)Completed, printed full stats summary
No --max-num-requests (regression)New (this PR)21 conversations processed without issue

Isolated reproducer test (proves the mechanism):

TestResult
Old pattern (close + join_thread) with pipe overflowDeadlocks (3s timeout hit)
New pattern (cancel_join_thread + close) with pipe overflowCompletes instantly
New pattern on empty queueCompletes instantly
<details> <summary>Reproducer test script (not part of this PR)</summary>
import multiprocessing as mp
import time

PIPE_OVERFLOW_ITEM_COUNT = 150
ITEM_PAYLOAD = "x" * 500

def _scenario_old_cleanup(item_count):
    q = mp.Queue()
    for i in range(item_count):
        q.put((i, ITEM_PAYLOAD))
    while not q.empty():
        q.get()
    q.close()
    q.join_thread()  # DEADLOCK

def _scenario_new_cleanup(item_count):
    q = mp.Queue()
    for i in range(item_count):
        q.put((i, ITEM_PAYLOAD))
    while True:
        try:
            q.get_nowait()
        except Exception:
            time.sleep(0.1)
            try:
                q.get_nowait()
            except Exception:
                break
    q.cancel_join_thread()
    q.close()

def _scenario_new_cleanup_empty():
    q = mp.Queue()
    for i in range(10):
        q.put((i, ITEM_PAYLOAD))
    time.sleep(0.2)
    while True:
        try:
            q.get_nowait()
        except Exception:
            break
    q.cancel_join_thread()
    q.close()

def _run_with_timeout(target, timeout_sec):
    start = time.monotonic()
    p = mp.Process(target=target)
    p.start()
    p.join(timeout=timeout_sec)
    elapsed = time.monotonic() - start
    if p.is_alive():
        p.terminate()
        p.join()
        return False, elapsed
    return True, elapsed

def test_old_pattern_deadlocks_on_full_pipe():
    completed, _ = _run_with_timeout(
        lambda: _scenario_old_cleanup(PIPE_OVERFLOW_ITEM_COUNT), timeout_sec=3)
    assert not completed

def test_new_pattern_completes_on_full_pipe():
    completed, elapsed = _run_with_timeout(
        lambda: _scenario_new_cleanup(PIPE_OVERFLOW_ITEM_COUNT), timeout_sec=5)
    assert completed and elapsed < 2.0

def test_new_pattern_works_on_empty_queue():
    completed, elapsed = _run_with_timeout(
        _scenario_new_cleanup_empty, timeout_sec=3)
    assert completed and elapsed < 1.0
</details>

Changed files

  • benchmarks/multi_turn/benchmark_serving_multi_turn.py (modified, +31/-10)

Code Example

python benchmarks/multi_turn/benchmark_serving_multi_turn.py \
    --model NousResearch/Hermes-3-Llama-3.1-8B \
    --url http://localhost:8000 \
    --input-file sharegpt_500.json \
    --num-clients 8 \
    --max-num-requests 2000 \
    --output-file results.json

---

All 8 clients exited (successfully finished 191 out of 500 conversations)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Environment</summary>
  • vLLM main branch, commit 48698b1b9 (2026-05-10)
  • OS: Linux (RHEL 9, kernel 5.14.0)
  • Python: 3.13
  • Hardware: Not relevant (benchmark client-side bug, not GPU/inference related)
</details>

🐛 Describe the bug

What happens: When --max-num-requests is used with large conversations (e.g., ShareGPT multi-turn), benchmark_serving_multi_turn.py hangs forever after printing "All N clients exited". No stats summary is produced, no output JSON is written.

How to reproduce:

python benchmarks/multi_turn/benchmark_serving_multi_turn.py \
    --model NousResearch/Hermes-3-Llama-3.1-8B \
    --url http://localhost:8000 \
    --input-file sharegpt_500.json \
    --num-clients 8 \
    --max-num-requests 2000 \
    --output-file results.json

Conditions that trigger it:

  • --max-num-requests causes early termination (many unconsumed tasks remain in the queue)
  • Conversations are large enough that unconsumed items overflow the OS pipe buffer (~64KB on Linux)
  • More conversations queued than will be consumed (num_conversations >> max_num_requests / avg_turns)

Conditions where it does NOT trigger:

  • Small/synthetic conversations (zipf) that fit in the pipe buffer
  • No --max-num-requests (all conversations consumed)

Expected behavior: After "All N clients exited", the benchmark prints the statistics summary and writes the output JSON.

Actual behavior: Hangs indefinitely after:

All 8 clients exited (successfully finished 191 out of 500 conversations)

Root cause: task_queue.join_thread() (line ~1059) deadlocks because:

  1. The drain loop uses mp.Queue.empty() which is documented as unreliable — it can return True while the feeder thread still has items in its internal buffer
  2. join_thread() waits for the feeder thread to finish flushing to the OS pipe, but the pipe is full and no process is reading from it

This is the classic multiprocessing deadlock pattern described in the Python docs:

"a process that puts items in a queue will wait before terminating until all the buffered items are fed by the 'feeder' thread to the underlying pipe"

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: benchmark_serving_multi_turn.py deadlocks after clients exit when --max-num-requests is used [1 pull requests, 1 participants]