vllm - 💡(How to fix) Fix [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() [1 comments, 1 participants]

Code Example

Your output of `python collect_env.py` here

---

# vllm/v1/worker/gpu/async_utils.py — AsyncPoolingOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))  # ❌ accesses tensor BEFORE sync
    self.copy_event.synchronize()                                # sync happens too late
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

---

# vllm/v1/worker/gpu/async_utils.py — AsyncOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()  # ✅ sync FIRST
    sampled_token_ids: list[list[int]] = self.sampled_token_ids.tolist()  # access AFTER
    # ...

---

with stream(copy_stream, main_stream):
    copy_stream.wait_stream(main_stream)
    self.pooler_output_cpu = self.pooler_output.to("cpu", non_blocking=True)
    # ...
    self.copy_event.record(copy_stream)

---

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()                                # ✅ sync FIRST
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))   # access AFTER
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

Your output of `python collect_env.py` here

N/A - this is a code-level race condition identified during experimentation, not a runtime environment issue.

</details>

🐛 Describe the bug

There is a race condition in AsyncPoolingOutput.get_output() in vllm/v1/worker/gpu/async_utils.py. The method accesses self.pooler_output_cpu (via .unbind()) before calling self.copy_event.synchronize(), meaning the asynchronous GPU→CPU copy may not yet be complete when the tensor data is consumed.

Current code (incorrect ordering):

# vllm/v1/worker/gpu/async_utils.py — AsyncPoolingOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))  # ❌ accesses tensor BEFORE sync
    self.copy_event.synchronize()                                # sync happens too late
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

The sibling class AsyncOutput.get_output() in the same file does this correctly - it calls self.copy_event.synchronize() first, then accesses the CPU tensors:

# vllm/v1/worker/gpu/async_utils.py — AsyncOutput.get_output()

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()  # ✅ sync FIRST
    sampled_token_ids: list[list[int]] = self.sampled_token_ids.tolist()  # access AFTER
    # ...

Context: In init, the GPU→CPU transfer is initiated non-blockingly on a separate CUDA stream and copy_event is recorded after it:

with stream(copy_stream, main_stream):
    copy_stream.wait_stream(main_stream)
    self.pooler_output_cpu = self.pooler_output.to("cpu", non_blocking=True)
    # ...
    self.copy_event.record(copy_stream)

copy_event.synchronize() is the mechanism that guarantees the host waits until the copy completes. Accessing the destination tensor before this call means reading from a buffer that may be partially filled.

Steps to reproduce:

Use a pooling/embedding model with AsyncLLM. Set VLLM_USE_V2_MODEL_RUNNER=1. Call embed() with a batch of prompts under GPU load. Observe that returned embedding vectors are sometimes partially zeroed or contain stale values.

Expected behavior:

Embedding vectors should always be complete and correct.

Actual behavior:

Embedding vectors can be partially filled with incorrect values when the GPU→CPU copy has not finished by the time unbind() executes.

Suggested fix:

Move self.copy_event.synchronize() before the tensor access, matching the AsyncOutput pattern:

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()                                # ✅ sync FIRST
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))   # access AFTER
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the race condition issue in AsyncPoolingOutput.get_output(), follow these steps:

Move self.copy_event.synchronize() before accessing self.pooler_output_cpu.
Ensure that the synchronization call is made before any operations that rely on the completion of the GPU→CPU copy.

Example code:

def get_output(self) -> ModelRunnerOutput:
    self.copy_event.synchronize()  # Ensure copy is complete
    pooler_output = list(self.pooler_output_cpu.unbind(dim=0))  # Access after sync
    if self.is_valid_cpu is not None:
        is_valid_cpu = self.is_valid_cpu.tolist()
        for i, is_valid in enumerate(is_valid_cpu):
            if not is_valid:
                pooler_output[i] = None
    self.model_runner_output.pooler_output = pooler_output
    return self.model_runner_output

Verification

To verify that the fix worked:

Run the embed() function with a batch of prompts under GPU load.
Check that the returned embedding vectors are complete and correct.
Repeat the test multiple times to ensure consistency.

Extra Tips

Always ensure that synchronization calls are made before accessing data that is being transferred asynchronously.
Use synchronization mechanisms like copy_event.synchronize() to guarantee that the host waits until the copy completes.
Refer to the AsyncOutput.get_output() method in the same file for a correct example of synchronization before accessing CPU tensors.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Race condition in AsyncPoolingOutput.get_output() — pooler_output_cpu accessed before copy_event.synchronize() [1 comments, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING