vllm - ✅(Solved) Fix [Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters [2 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36755Fetched 2026-04-08 00:35:02
View on GitHub
Comments
3
Participants
3
Timeline
10
Reactions
0
Author
Timeline (top)
commented ×3referenced ×3cross-referenced ×2added_to_project_v2 ×1

Error Message

ValueError: Counters can only be incremented by non-negative amounts.

Root Cause

The deterministic root cause appears to be:

update_from_output() / EngineCoreOutput(...) reads mutable fields from Request at output-materialization time, instead of using per-step snapshots captured at scheduling time.

This race is exposed much more easily with async scheduling, but the stale num_cached_tokens issue also exists independently on the preemption path.

Fix Action

Fixed

PR fix notes

PR #36757: Fix(Scheduler): Reset num_cached_tokens on preemption to prevent acco…

Description (problem / solution / changelog)

…unting crash (#36755)

<!-- markdownlint-disable -->

Purpose

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/model_executor/layers/fused_moe/experts/trtllm_fp8_moe.py (modified, +6/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +1/-0)

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Description (problem / solution / changelog)

Since num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:

ValueError: Counters can only be incremented by non-negative amounts

The invariant check enforces:

prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0

with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:

num_external_computed_tokens <= num_cached_tokens + recomputed

When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).

Related to issues #36533, #36755 and PRs #36638, #36752, #36757.

Changed files

  • tests/v1/metrics/test_stats.py (modified, +82/-6)
  • vllm/v1/metrics/stats.py (modified, +66/-2)

Code Example

ValueError: Counters can only be incremented by non-negative amounts.

---

output_handler
-> PrometheusStatLogger.record()
-> counter_prompt_tokens_by_source["local_cache_hit"].inc(...)
-> ValueError
-> propagate_error()
-> all in-flight streaming requests return 500

---

# scheduler.py::_preempt_request()
request.num_computed_tokens = 0
request.num_cached_tokens = -1

---

DIAGNOSTIC: req=... cached=-1 ext=0 prompt_len=3601 computed=0 preemptions=1 status=PREEMPTED
RAW_BUFFERClick to expand / collapse

Current environment

  • vLLM V1 engine
  • local checkout: v0.17.0rc0-134-g58928475e
  • KV connector in use (PegaKVConnector)
  • 8 workers
  • model: Qwen3-8B
  • block size: 64
  • async scheduling enabled in the failing configuration

Describe the bug

Under load, workers crash with:

ValueError: Counters can only be incremented by non-negative amounts.

Crash path:

output_handler
-> PrometheusStatLogger.record()
-> counter_prompt_tokens_by_source["local_cache_hit"].inc(...)
-> ValueError
-> propagate_error()
-> all in-flight streaming requests return 500

Findings

There appear to be two related problems.

1. Preemption state reset is incomplete

_preempt_request() resets request.num_computed_tokens = 0 but does not reset request.num_cached_tokens.

Without resetting num_cached_tokens, a resumed request can recompute / requery external KV state with a fresh num_external_computed_tokens, while num_cached_tokens still holds an older pre-preemption value. This can make prompt-token source accounting inconsistent and can produce negative values for local_cache_hit.

A minimal fix for this part is:

# scheduler.py::_preempt_request()
request.num_computed_tokens = 0
request.num_cached_tokens = -1

2. Async scheduling introduces a race on live Request state

Even with the reset above, async scheduling still crashes.

The deeper issue is that schedule(N+1) and update_from_output(N) can overlap. If a request is preempted during schedule(N+1), mutable fields on the live Request object are updated (num_cached_tokens, num_computed_tokens, status, etc.), while update_from_output(N) still reads the same object to build EngineCoreOutput.

Diagnostics consistently captured outputs where the request state had already been mutated by preemption:

DIAGNOSTIC: req=... cached=-1 ext=0 prompt_len=3601 computed=0 preemptions=1 status=PREEMPTED

Seeing status=PREEMPTED while materializing output strongly suggests that output accounting is reading live mutable Request state, not a schedule-time snapshot.

Root cause

The deterministic root cause appears to be:

update_from_output() / EngineCoreOutput(...) reads mutable fields from Request at output-materialization time, instead of using per-step snapshots captured at scheduling time.

This race is exposed much more easily with async scheduling, but the stale num_cached_tokens issue also exists independently on the preemption path.

Observed behavior matrix

ConfigurationResult
async scheduling + no fixcrash
async scheduling + num_cached_tokens = -1 in _preempt_request()still crashes
async scheduling disabled + num_cached_tokens = -1 in _preempt_request()stable in our testing
async scheduling disabled + no fixcrash

Suggested fix direction

  1. Required: reset request.num_cached_tokens = -1 in _preempt_request()
  2. Required: stop reading num_cached_tokens / num_external_computed_tokens from the live mutable Request during output materialization; instead, snapshot them per request in SchedulerOutput (or equivalent step-local metadata)
  3. Optional hardening: clamp / guard metrics-side counter increments so bad accounting cannot crash the entire engine

Reproduction conditions

  • external KV connector enabled
  • high concurrency / high memory pressure causing frequent preemption
  • async scheduling makes the race much easier to trigger

extent analysis

Fix Plan

To resolve the issue, follow these steps:

  1. Reset num_cached_tokens in _preempt_request(): Update the _preempt_request() method in scheduler.py to reset num_cached_tokens to -1:

scheduler.py::_preempt_request()

request.num_computed_tokens = 0 request.num_cached_tokens = -1


2. **Snapshot request metrics in `SchedulerOutput`**:
   Modify the `SchedulerOutput` class to include snapshots of `num_cached_tokens` and `num_external_computed_tokens` per request. This ensures that output materialization uses the correct, non-mutable values:
   ```python
class SchedulerOutput:
    def __init__(self, request, num_cached_tokens, num_external_computed_tokens):
        self.request = request
        self.num_cached_tokens = num_cached_tokens
        self.num_external_computed_tokens = num_external_computed_tokens

Update the scheduling logic to capture these snapshots when creating SchedulerOutput instances.

  1. Update update_from_output() to use snapshots: Modify the update_from_output() method to use the snapshot values from SchedulerOutput instead of reading from the live Request object:

def update_from_output(output): # Use output.num_cached_tokens and output.num_external_computed_tokens # instead of request.num_cached_tokens and request.num_external_computed_tokens pass


4. **Optional: Clamp metrics counter increments**:
   To prevent crashes due to bad accounting, consider adding checks to ensure that counter increments are non-negative:
   ```python
def increment_counter(counter, value):
    if value < 0:
        # Handle or log the error, and optionally set value to 0
        value = 0
    counter.inc(value)

Verification

To verify the fix, test the system under the same conditions that previously caused crashes, including:

  • External KV connector enabled
  • High concurrency and memory pressure
  • Async scheduling enabled

Monitor the system for crashes and verify that the ValueError exception is no longer raised.

Extra Tips

  • Regularly review and test the system under various loads and configurations to catch similar issues early.
  • Consider implementing additional logging or monitoring to detect and report inconsistencies in request metrics and counter increments.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug] Preemption + async scheduling race can corrupt prompt-token accounting and crash Prometheus counters [2 pull requests, 3 comments, 3 participants]