vllm - ✅(Solved) Fix [Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters [2 pull requests, 3 comments, 3 participants]

vllm2026-03-11 07:49:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36755•Fetched 2026-04-08 00:35:02

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×3referenced ×3cross-referenced ×2added_to_project_v2 ×1

Error Message

ValueError: Counters can only be incremented by non-negative amounts.

Root Cause

The deterministic root cause appears to be:

update_from_output() / EngineCoreOutput(...) reads mutable fields from Request at output-materialization time, instead of using per-step snapshots captured at scheduling time.

This race is exposed much more easily with async scheduling, but the stale num_cached_tokens issue also exists independently on the preemption path.

Fix Action

Fixed

Fixed by PR: Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco… (https://github.com/vllm-project/vllm/pull/36757)
Fixed by PR: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts" (https://github.com/vllm-project/vllm/pull/36812)

PR fix notes

PR #36757: Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…

Repository: vllm-project/vllm
Author: xueliangyang-oeuler
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36757

Description (problem / solution / changelog)

…unting crash (#36755)

Purpose

Test Plan

Test Result

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py (modified, +6/-0)
vllm/v1/core/sched/scheduler.py (modified, +1/-0)

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Repository: vllm-project/vllm
Author: markmc
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36812

Description (problem / solution / changelog)

Since num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:

ValueError: Counters can only be incremented by non-negative amounts

The invariant check enforces:

prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0

with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:

num_external_computed_tokens <= num_cached_tokens + recomputed

When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).

Related to issues #36533, #36755 and PRs #36638, #36752, #36757.

Changed files

tests/v1/metrics/test_stats.py (modified, +82/-6)
vllm/v1/metrics/stats.py (modified, +66/-2)

Code Example

ValueError: Counters can only be incremented by non-negative amounts.

---

output_handler
-> PrometheusStatLogger.record()
-> counter_prompt_tokens_by_source["local_cache_hit"].inc(...)
-> ValueError
-> propagate_error()
-> all in-flight streaming requests return 500

---

# scheduler.py::_preempt_request()
request.num_computed_tokens = 0
request.num_cached_tokens = -1

---

DIAGNOSTIC: req=... cached=-1 ext=0 prompt_len=3601 computed=0 preemptions=1 status=PREEMPTED

RAW_BUFFERClick to expand / collapse

Current environment

vLLM V1 engine
local checkout: v0.17.0rc0-134-g58928475e
KV connector in use (PegaKVConnector)
8 workers
model: Qwen3-8B
block size: 64
async scheduling enabled in the failing configuration

Describe the bug

Under load, workers crash with:

ValueError: Counters can only be incremented by non-negative amounts.

Crash path:

output_handler
-> PrometheusStatLogger.record()
-> counter_prompt_tokens_by_source["local_cache_hit"].inc(...)
-> ValueError
-> propagate_error()
-> all in-flight streaming requests return 500

Findings

There appear to be two related problems.

1. Preemption state reset is incomplete

_preempt_request() resets request.num_computed_tokens = 0 but does not reset request.num_cached_tokens.

Without resetting num_cached_tokens, a resumed request can recompute / requery external KV state with a fresh num_external_computed_tokens, while num_cached_tokens still holds an older pre-preemption value. This can make prompt-token source accounting inconsistent and can produce negative values for local_cache_hit.

A minimal fix for this part is:

# scheduler.py::_preempt_request()
request.num_computed_tokens = 0
request.num_cached_tokens = -1

2. Async scheduling introduces a race on live Request state

Even with the reset above, async scheduling still crashes.

The deeper issue is that schedule(N+1) and update_from_output(N) can overlap. If a request is preempted during schedule(N+1), mutable fields on the live Request object are updated (num_cached_tokens, num_computed_tokens, status, etc.), while update_from_output(N) still reads the same object to build EngineCoreOutput.

Diagnostics consistently captured outputs where the request state had already been mutated by preemption:

DIAGNOSTIC: req=... cached=-1 ext=0 prompt_len=3601 computed=0 preemptions=1 status=PREEMPTED

Seeing status=PREEMPTED while materializing output strongly suggests that output accounting is reading live mutable Request state, not a schedule-time snapshot.

Root cause

The deterministic root cause appears to be:

update_from_output() / EngineCoreOutput(...) reads mutable fields from Request at output-materialization time, instead of using per-step snapshots captured at scheduling time.

This race is exposed much more easily with async scheduling, but the stale num_cached_tokens issue also exists independently on the preemption path.

Observed behavior matrix

Configuration	Result
async scheduling + no fix	crash
async scheduling + `num_cached_tokens = -1` in `_preempt_request()`	still crashes
async scheduling disabled + `num_cached_tokens = -1` in `_preempt_request()`	stable in our testing
async scheduling disabled + no fix	crash

Suggested fix direction

Required: reset request.num_cached_tokens = -1 in _preempt_request()
Required: stop reading num_cached_tokens / num_external_computed_tokens from the live mutable Request during output materialization; instead, snapshot them per request in SchedulerOutput (or equivalent step-local metadata)
Optional hardening: clamp / guard metrics-side counter increments so bad accounting cannot crash the entire engine

Reproduction conditions

external KV connector enabled
high concurrency / high memory pressure causing frequent preemption
async scheduling makes the race much easier to trigger

extent analysis

Fix Plan

To resolve the issue, follow these steps:

Reset num_cached_tokens in _preempt_request(): Update the _preempt_request() method in scheduler.py to reset num_cached_tokens to -1:

scheduler.py::_preempt_request()

request.num_computed_tokens = 0 request.num_cached_tokens = -1


2. **Snapshot request metrics in `SchedulerOutput`**:
   Modify the `SchedulerOutput` class to include snapshots of `num_cached_tokens` and `num_external_computed_tokens` per request. This ensures that output materialization uses the correct, non-mutable values:
   ```python
class SchedulerOutput:
    def __init__(self, request, num_cached_tokens, num_external_computed_tokens):
        self.request = request
        self.num_cached_tokens = num_cached_tokens
        self.num_external_computed_tokens = num_external_computed_tokens

Update the scheduling logic to capture these snapshots when creating SchedulerOutput instances.

Update update_from_output() to use snapshots: Modify the update_from_output() method to use the snapshot values from SchedulerOutput instead of reading from the live Request object:

def update_from_output(output): # Use output.num_cached_tokens and output.num_external_computed_tokens # instead of request.num_cached_tokens and request.num_external_computed_tokens pass


4. **Optional: Clamp metrics counter increments**:
   To prevent crashes due to bad accounting, consider adding checks to ensure that counter increments are non-negative:
   ```python
def increment_counter(counter, value):
    if value < 0:
        # Handle or log the error, and optionally set value to 0
        value = 0
    counter.inc(value)

Verification

To verify the fix, test the system under the same conditions that previously caused crashes, including:

External KV connector enabled
High concurrency and memory pressure
Async scheduling enabled

Monitor the system for crashes and verify that the ValueError exception is no longer raised.

Extra Tips

Regularly review and test the system under various loads and configurations to catch similar issues early.
Consider implementing additional logging or monitoring to detect and report inconsistencies in request metrics and counter increments.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #integration issue #index setup #retrieval issue #search optimization #API routing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters [2 pull requests, 3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #36757: Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Description (problem / solution / changelog)

Changed files

Code Example

Current environment

Describe the bug

Findings

1. Preemption state reset is incomplete

2. Async scheduling introduces a race on live Request state

Root cause

Observed behavior matrix

Suggested fix direction

Reproduction conditions

extent analysis

Fix Plan

scheduler.py::_preempt_request()

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING