vllm - ✅(Solved) Fix Why is an assertion used here? [1 pull requests, 4 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37837Fetched 2026-04-08 01:17:47
View on GitHub
Comments
4
Participants
4
Timeline
9
Reactions
0
Author
Timeline (top)
commented ×4subscribed ×2closed ×1cross-referenced ×1

Fix Action

Fixed

PR fix notes

PR #37859: [Bugfix][Core] Ignore stale KV xfer callbacks after request cleanup (#37837)

Description (problem / solution / changelog)

Purpose

#37837: in PD setups you can abort a request on the decode side while a KV transfer is still finishing. Cleanup removes the request from self.requests, then _update_from_kv_xfer_finished() gets a late finished_recving / finished_sending for that id. The code asserted the id was still in self.requests, which isn't true in that case, so the decode node dies.

Fix is to treat that as a stale callback and skip it instead of asserting. Added a couple tests for the late-notification path.

Test Plan

Linux + Python 3.12 + requirements/test.txt is what matches CI (the lockfile is Linux/CUDA oriented).

Lint:

uv pip install -r requirements/lint.txt
pre-commit install
pre-commit run --all-files

Tests (focused):

python -m pytest tests/v1/core/test_scheduler.py -v \
  -k "abort_request_waiting_for_remote_kvs or abort_request_finished_recving or ignore_late_finished"

Whole file if you feel like it:

python -m pytest tests/v1/core/test_scheduler.py -v

Commit with -s for DCO (Signed-off-by).

Test Result

Will paste pytest output here after I run the above on Linux.

Before: n/a (or describe how you repro’d it)

After:

<will add this soon>

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

RAW_BUFFERClick to expand / collapse

https://github.com/vllm-project/vllm/blob/43877a620bf629d3625c870ef787e590101e0518/vllm/v1/core/sched/scheduler.py#L2103

My current online test scenario is a prefill/decode separation setup. After the prefill node processes a request and forwards it to the decode node via the Proxy, the connection between the Proxy and the decode node is immediately closed, triggering a decode abort. In this scenario, when the scheduler executes _update_from_kv_xfer_finishedto update the KV state, it finds that the request's req_idis no longer in self.requests, causing an assertion failure and leading to a complete crash of the D node.

extent analysis

Fix Plan

The fix involves modifying the _update_from_kv_xfer_finished method to handle cases where the request ID is no longer in self.requests.

Steps

  • Check if the request ID exists in self.requests before attempting to update its state.
  • If the request ID does not exist, skip the update operation to prevent the assertion failure.

Example Code

def _update_from_kv_xfer_finished(self, req_id, kv_xfer):
    if req_id not in self.requests:
        # Request ID no longer exists, skip update
        return
    # Existing update logic here
    pass

Verification

To verify the fix, run the online test scenario again and check that the assertion failure and subsequent crash no longer occur.

Extra Tips

  • Consider adding logging to track cases where the request ID is no longer found in self.requests to help identify potential issues.
  • Review the request lifecycle management to ensure that request IDs are properly removed from self.requests when they are completed or aborted.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix Why is an assertion used here? [1 pull requests, 4 comments, 4 participants]