vllm - ✅(Solved) Fix Why is an assertion used here? [1 pull requests, 4 comments, 4 participants]

vllm2026-03-23 02:26:19

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37837•Fetched 2026-04-08 01:17:47

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×4subscribed ×2closed ×1cross-referenced ×1

Fix Action

Fixed

Fixed by PR: [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837) (https://github.com/vllm-project/vllm/pull/37859)

PR fix notes

PR #37859: [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837)

Repository: vllm-project/vllm
Author: rohithj7
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/37859

Description (problem / solution / changelog)

Purpose

#37837: in PD setups you can abort a request on the decode side while a KV transfer is still finishing. Cleanup removes the request from self.requests, then _update_from_kv_xfer_finished() gets a late finished_recving / finished_sending for that id. The code asserted the id was still in self.requests, which isn't true in that case, so the decode node dies.

Fix is to treat that as a stale callback and skip it instead of asserting. Added a couple tests for the late-notification path.

Test Plan

Linux + Python 3.12 + requirements/test.txt is what matches CI (the lockfile is Linux/CUDA oriented).

Lint:

uv pip install -r requirements/lint.txt
pre-commit install
pre-commit run --all-files

Tests (focused):

python -m pytest tests/v1/core/test_scheduler.py -v \
  -k "abort_request_waiting_for_remote_kvs or abort_request_finished_recving or ignore_late_finished"

Whole file if you feel like it:

python -m pytest tests/v1/core/test_scheduler.py -v

Commit with -s for DCO (Signed-off-by).

Test Result

Will paste pytest output here after I run the above on Linux.

Before: n/a (or describe how you repro’d it)

After:

<will add this soon>

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

RAW_BUFFERClick to expand / collapse

https://github.com/vllm-project/vllm/blob/43877a620bf629d3625c870ef787e590101e0518/vllm/v1/core/sched/scheduler.py#L2103

My current online test scenario is a prefill/decode separation setup. After the prefill node processes a request and forwards it to the decode node via the Proxy, the connection between the Proxy and the decode node is immediately closed, triggering a decode abort. In this scenario, when the scheduler executes _update_from_kv_xfer_finishedto update the KV state, it finds that the request's req_idis no longer in self.requests, causing an assertion failure and leading to a complete crash of the D node.

extent analysis

Fix Plan

The fix involves modifying the _update_from_kv_xfer_finished method to handle cases where the request ID is no longer in self.requests.

Steps

Check if the request ID exists in self.requests before attempting to update its state.
If the request ID does not exist, skip the update operation to prevent the assertion failure.

Example Code

def _update_from_kv_xfer_finished(self, req_id, kv_xfer):
    if req_id not in self.requests:
        # Request ID no longer exists, skip update
        return
    # Existing update logic here
    pass

Verification

To verify the fix, run the online test scenario again and check that the assertion failure and subsequent crash no longer occur.

Extra Tips

Consider adding logging to track cases where the request ID is no longer found in self.requests to help identify potential issues.
Review the request lifecycle management to ensure that request IDs are properly removed from self.requests when they are completed or aborted.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#device allocation #model download #tokenizer error #prompt formatting #chain error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix Why is an assertion used here? [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #37859: [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837)

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

extent analysis

Fix Plan

Steps

Example Code

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix Why is an assertion used here? [1 pull requests, 4 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #37859: [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837)

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

extent analysis

Fix Plan

Steps

Example Code

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING