vllm - ✅(Solved) Fix [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency [2 pull requests, 2 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36533Fetched 2026-04-08 00:36:21
View on GitHub
Comments
2
Participants
3
Timeline
14
Reactions
0
Author
Timeline (top)
referenced ×4cross-referenced ×3commented ×2added_to_project_v2 ×1

Error Message

(APIServer pid=1327397) INFO 03-09 14:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 770.3 tokens/s, Running: 256 reqs, Waiting: 764 reqs, GPU KV cache usage: 97.5%, Prefix cache hit rate: 4.3%, External prefix cache hit rate: 0.0% (APIServer pid=1327397) INFO 03-09 14:34:35 [metrics.py:103] KV Transfer metrics: CPU_to_GPU_total_bytes=810024960, CPU_to_GPU_total_time=0.19331251902878285, GPU_to_CPU_total_bytes=55050240, GPU_to_CPU_total_time=0.0010710399970412254 (APIServer pid=1327397) INFO: 127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK (APIServer pid=1327397) INFO: 127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] AsyncLLM output_handler failed. (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] Traceback (most recent call last): (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 703, in output_handler (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] logger_manager.record( (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1309, in record (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] logger.record( (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] self.counter_prompt_tokens_by_source[source][engine_idx].inc( (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] raise ValueError('Counters can only be incremented by non-negative amounts.') (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts. (APIServer pid=1327397) ERROR 03-09 14:34:37 [serving.py:1386] Error in chat completion stream generator.

Fix Action

Fixed

PR fix notes

PR #36638: [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading

Description (problem / solution / changelog)

Purpose

Fix engine crash (ValueError: Counters can only be incremented by non-negative amounts) that occurs under high concurrency with CPU KV offloading enabled (GitHub issue #36533).

Root cause: In _update_requests_with_invalid_blocks (scheduler.py), when KV cache blocks fail to load, num_affected_tokens — which includes both locally-cached and externally-loaded tokens — is subtracted entirely from request.num_external_computed_tokens, driving it negative. This negative value propagates through EngineCoreOutputPromptTokenStatsPrometheusStatLogger.record()Counter.inc() crash.

Secondary issue: request.num_cached_tokens is set once during initial scheduling and never reset when a request is freed for retry after KV failure or preemption. On reschedule, num_external_computed_tokens may be re-queried to a new value while num_cached_tokens stays stale, creating a mismatch that can also produce negative local_cache_hit.

Fix:

  1. Clamp num_external_computed_tokens to non-negative after subtraction in _update_requests_with_invalid_blocks
  2. Reset num_cached_tokens to -1 (sentinel) in preemption and KV failure retry paths so it is re-captured consistently on reschedule
  3. Add defensive guards in PromptTokenStats.update_from_output to clamp inputs and maintain invariants

Test Plan

  • Added test_prompt_token_stats_negative_external_clamped — verifies negative num_external_computed_tokens is clamped to 0
  • Added test_prompt_token_stats_external_exceeds_cached — verifies num_external_computed_tokens > num_cached_tokens is clamped
  • Added test_prompt_token_stats_negative_cached_clamped — verifies negative num_cached_tokens is clamped to 0
  • Added test_prompt_token_stats_all_non_negative (parametrized, 6 cases) — fuzz-style check that all PromptTokenStats fields remain non-negative across edge cases

Test Result

tests/v1/metrics/test_stats.py::test_prompt_token_stats_negative_external_clamped PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_external_exceeds_cached PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_negative_cached_clamped PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[0-0-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[50--50-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[-10-0-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[50-100-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[-5--5-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[99-100-100] PASSED

All 19 tests in test_stats.py pass. All 3 tests in test_invalid_blocks_correctness.py pass.

Changed files

  • tests/v1/metrics/test_stats.py (modified, +81/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +8/-1)
  • vllm/v1/metrics/stats.py (modified, +8/-0)

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Description (problem / solution / changelog)

Since num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:

ValueError: Counters can only be incremented by non-negative amounts

The invariant check enforces:

prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0

with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:

num_external_computed_tokens <= num_cached_tokens + recomputed

When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).

Related to issues #36533, #36755 and PRs #36638, #36752, #36757.

Changed files

  • tests/v1/metrics/test_stats.py (modified, +82/-6)
  • vllm/v1/metrics/stats.py (modified, +66/-2)

Code Example

vllm serve nvidia/Llama-3.3-70B-Instruct-FP4 --host 0.0.0.0 --port 8888 --config /workspace/results/config.yaml --max-num-seqs 1024 --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --attention-config.use_trtllm_attention=0 --kv_offloading_backend native --kv_offloading_size 100 --disable-hybrid-kv-cache-manager

---

(APIServer pid=1327397) INFO 03-09 14:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 770.3 tokens/s, Running: 256 reqs, Waiting: 764 reqs, GPU KV cache usage: 97.5%, Prefix cache hit rate: 4.3%, External prefix cache hit rate: 0.0%
(APIServer pid=1327397) INFO 03-09 14:34:35 [metrics.py:103] KV Transfer metrics: CPU_to_GPU_total_bytes=810024960, CPU_to_GPU_total_time=0.19331251902878285, GPU_to_CPU_total_bytes=55050240, GPU_to_CPU_total_time=0.0010710399970412254
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] AsyncLLM output_handler failed.
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] Traceback (most recent call last):
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 703, in output_handler
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger_manager.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1309, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     self.counter_prompt_tokens_by_source[source][engine_idx].inc(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     raise ValueError('Counters can only be incremented by non-negative amounts.')
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts.
(APIServer pid=1327397) ERROR 03-09 14:34:37 [serving.py:1386] Error in chat completion stream generator.

---

(EngineCore_DP0 pid=1327677) WARNING 03-09 14:33:09 [offloading_connector.py:435] Request chatcmpl-95b358ced4243ecd-8fd7556a: cannot store 579 blocks
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM 0.16.0

</details>

🐛 Describe the bug

Run server with:

vllm serve nvidia/Llama-3.3-70B-Instruct-FP4 --host 0.0.0.0 --port 8888 --config /workspace/results/config.yaml --max-num-seqs 1024 --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --attention-config.use_trtllm_attention=0 --kv_offloading_backend native --kv_offloading_size 100 --disable-hybrid-kv-cache-manager

Run basic multiturn benchmark client against server with concurrency equal to 2048 (and --max-num-seqs set accordingly).

Observe error:

(APIServer pid=1327397) INFO 03-09 14:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 770.3 tokens/s, Running: 256 reqs, Waiting: 764 reqs, GPU KV cache usage: 97.5%, Prefix cache hit rate: 4.3%, External prefix cache hit rate: 0.0%
(APIServer pid=1327397) INFO 03-09 14:34:35 [metrics.py:103] KV Transfer metrics: CPU_to_GPU_total_bytes=810024960, CPU_to_GPU_total_time=0.19331251902878285, GPU_to_CPU_total_bytes=55050240, GPU_to_CPU_total_time=0.0010710399970412254
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] AsyncLLM output_handler failed.
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] Traceback (most recent call last):
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 703, in output_handler
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger_manager.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1309, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     self.counter_prompt_tokens_by_source[source][engine_idx].inc(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     raise ValueError('Counters can only be incremented by non-negative amounts.')
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts.
(APIServer pid=1327397) ERROR 03-09 14:34:37 [serving.py:1386] Error in chat completion stream generator.

Note that this is definitely related to the KV offload connector metrics as it only occurs when KV offloading (native) is enabled. Also, this only appears to occur when the offload cache is full (indicated by warnings as shown below):

(EngineCore_DP0 pid=1327677) WARNING 03-09 14:33:09 [offloading_connector.py:435] Request chatcmpl-95b358ced4243ecd-8fd7556a: cannot store 579 blocks

extent analysis

Fix Plan

To resolve the issue, we need to address the ValueError exception caused by attempting to increment a Prometheus counter by a non-negative amount. This error occurs when the KV offload cache is full and the native offloading backend is enabled.

Here are the steps to fix the issue:

  • Modify the offloading_connector.py file to handle the case when the offload cache is full. We can do this by adding a check before attempting to store blocks in the cache.
  • Update the loggers.py file to handle the ValueError exception when recording metrics.

Code Changes

# offloading_connector.py
def store_blocks(self, blocks):
    if self.cache_full():
        # Handle the case when the cache is full
        logger.warning("Offload cache is full. Cannot store blocks.")
        return
    # Store blocks in the cache
    ...

def cache_full(self):
    # Check if the cache is full
    return self.cache_size >= self.max_cache_size
# loggers.py
def record(self, metric, value):
    try:
        # Record the metric
        self.counter_prompt_tokens_by_source[source][engine_idx].inc(value)
    except ValueError as e:
        # Handle the ValueError exception
        logger.error(f"Error recording metric: {e}")

Verification

To verify that the fix worked, run the server with the same configuration and benchmark client. The ValueError exception should no longer occur, and the server should handle the case when the offload cache is full without errors.

Extra Tips

  • Make sure to update the offloading_connector.py and loggers.py files with the correct changes.
  • Test the changes thoroughly to ensure that the issue is resolved.
  • Consider adding additional logging and monitoring to detect similar issues in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency [2 pull requests, 2 comments, 3 participants]