vllm - 💡(How to fix) Fix [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() [1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36942Fetched 2026-04-08 00:43:29
View on GitHub
Comments
1
Participants
1
Timeline
3
Reactions
1
Participants
Timeline (top)
closed ×1commented ×1labeled ×1

Code Example

Your output of `python collect_env.py` here

---

# vllm/v1/worker/gpu/async_utils.pyAsyncPoolingOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))  # ❌ accesses tensor BEFORE sync
    self.copy_event.synchronize()                                # sync happens too late
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

---

# vllm/v1/worker/gpu/async_utils.pyAsyncOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()  # ✅ sync FIRST
    sampled_token_ids: list[list[int]] = self.sampled_token_ids.tolist()  # access AFTER
    # ...

---

with stream(copy_stream, main_stream):
    copy_stream.wait_stream(main_stream)
    self.pooler_output_cpu = self.pooler_output.to("cpu", non_blocking=True)
    # ...
    self.copy_event.record(copy_stream)

---

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()                                # ✅ sync FIRST
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))   # access AFTER
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here

N/A - this is a code-level race condition identified during experimentation, not a runtime environment issue.

</details>

🐛 Describe the bug

There is a race condition in AsyncPoolingOutput.get_output() in vllm/v1/worker/gpu/async_utils.py. The method accesses self.pooler_output_cpu (via .unbind()) before calling self.copy_event.synchronize(), meaning the asynchronous GPU→CPU copy may not yet be complete when the tensor data is consumed.

Current code (incorrect ordering):

# vllm/v1/worker/gpu/async_utils.py — AsyncPoolingOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))  # ❌ accesses tensor BEFORE sync
    self.copy_event.synchronize()                                # sync happens too late
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

The sibling class AsyncOutput.get_output() in the same file does this correctly - it calls self.copy_event.synchronize() first, then accesses the CPU tensors:

# vllm/v1/worker/gpu/async_utils.py — AsyncOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()  # ✅ sync FIRST
    sampled_token_ids: list[list[int]] = self.sampled_token_ids.tolist()  # access AFTER
    # ...

Context: In init, the GPU→CPU transfer is initiated non-blockingly on a separate CUDA stream and copy_event is recorded after it:

with stream(copy_stream, main_stream):
    copy_stream.wait_stream(main_stream)
    self.pooler_output_cpu = self.pooler_output.to("cpu", non_blocking=True)
    # ...
    self.copy_event.record(copy_stream)

copy_event.synchronize() is the mechanism that guarantees the host waits until the copy completes. Accessing the destination tensor before this call means reading from a buffer that may be partially filled.

Steps to reproduce:

Use a pooling/embedding model with AsyncLLM. Set VLLM_USE_V2_MODEL_RUNNER=1. Call embed() with a batch of prompts under GPU load. Observe that returned embedding vectors are sometimes partially zeroed or contain stale values.

Expected behavior:

Embedding vectors should always be complete and correct.

Actual behavior:

Embedding vectors can be partially filled with incorrect values when the GPU→CPU copy has not finished by the time unbind() executes.

Suggested fix:

Move self.copy_event.synchronize() before the tensor access, matching the AsyncOutput pattern:

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()                                # ✅ sync FIRST
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))   # access AFTER
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the race condition issue in AsyncPoolingOutput.get_output(), follow these steps:

  • Move self.copy_event.synchronize() before accessing self.pooler_output_cpu.
  • Ensure that the synchronization call is made before any operations that rely on the completion of the GPU→CPU copy.

Example code:

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()  # Ensure copy is complete
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))  # Access after sync
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

Verification

To verify that the fix worked:

  • Run the embed() function with a batch of prompts under GPU load.
  • Check that the returned embedding vectors are complete and correct.
  • Repeat the test multiple times to ensure consistency.

Extra Tips

  • Always ensure that synchronization calls are made before accessing data that is being transferred asynchronously.
  • Use synchronization mechanisms like copy_event.synchronize() to guarantee that the host waits until the copy completes.
  • Refer to the AsyncOutput.get_output() method in the same file for a correct example of synchronization before accessing CPU tensors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() [1 comments, 1 participants]