vllm - ✅(Solved) Fix [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency [2 pull requests, 2 comments, 3 participants]

vllm2026-03-09 18:56:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36533•Fetched 2026-04-08 00:36:21

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

referenced ×4cross-referenced ×3commented ×2added_to_project_v2 ×1

Error Message

(APIServer pid=1327397) INFO 03-09 14:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 770.3 tokens/s, Running: 256 reqs, Waiting: 764 reqs, GPU KV cache usage: 97.5%, Prefix cache hit rate: 4.3%, External prefix cache hit rate: 0.0% (APIServer pid=1327397) INFO 03-09 14:34:35 [metrics.py:103] KV Transfer metrics: CPU_to_GPU_total_bytes=810024960, CPU_to_GPU_total_time=0.19331251902878285, GPU_to_CPU_total_bytes=55050240, GPU_to_CPU_total_time=0.0010710399970412254 (APIServer pid=1327397) INFO: 127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK (APIServer pid=1327397) INFO: 127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] AsyncLLM output_handler failed. (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] Traceback (most recent call last): (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 703, in output_handler (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] logger_manager.record( (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1309, in record (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] logger.record( (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] self.counter_prompt_tokens_by_source[source][engine_idx].inc( (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] raise ValueError('Counters can only be incremented by non-negative amounts.') (APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts. (APIServer pid=1327397) ERROR 03-09 14:34:37 [serving.py:1386] Error in chat completion stream generator.

Fix Action

Fixed

Fixed by PR: [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading (https://github.com/vllm-project/vllm/pull/36638)
Fixed by PR: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts" (https://github.com/vllm-project/vllm/pull/36812)

PR fix notes

PR #36638: [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading

Repository: vllm-project/vllm
Author: haosdent
State: closed | merged: False
Link: https://github.com/vllm-project/vllm/pull/36638

Description (problem / solution / changelog)

Purpose

Fix engine crash (ValueError: Counters can only be incremented by non-negative amounts) that occurs under high concurrency with CPU KV offloading enabled (GitHub issue #36533).

Root cause: In _update_requests_with_invalid_blocks (scheduler.py), when KV cache blocks fail to load, num_affected_tokens — which includes both locally-cached and externally-loaded tokens — is subtracted entirely from request.num_external_computed_tokens, driving it negative. This negative value propagates through EngineCoreOutput → PromptTokenStats → PrometheusStatLogger.record() → Counter.inc() crash.

Secondary issue: request.num_cached_tokens is set once during initial scheduling and never reset when a request is freed for retry after KV failure or preemption. On reschedule, num_external_computed_tokens may be re-queried to a new value while num_cached_tokens stays stale, creating a mismatch that can also produce negative local_cache_hit.

Fix:

Clamp num_external_computed_tokens to non-negative after subtraction in _update_requests_with_invalid_blocks
Reset num_cached_tokens to -1 (sentinel) in preemption and KV failure retry paths so it is re-captured consistently on reschedule
Add defensive guards in PromptTokenStats.update_from_output to clamp inputs and maintain invariants

Test Plan

Added test_prompt_token_stats_negative_external_clamped — verifies negative num_external_computed_tokens is clamped to 0
Added test_prompt_token_stats_external_exceeds_cached — verifies num_external_computed_tokens > num_cached_tokens is clamped
Added test_prompt_token_stats_negative_cached_clamped — verifies negative num_cached_tokens is clamped to 0
Added test_prompt_token_stats_all_non_negative (parametrized, 6 cases) — fuzz-style check that all PromptTokenStats fields remain non-negative across edge cases

Test Result

tests/v1/metrics/test_stats.py::test_prompt_token_stats_negative_external_clamped PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_external_exceeds_cached PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_negative_cached_clamped PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[0-0-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[50--50-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[-10-0-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[50-100-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[-5--5-100] PASSED
tests/v1/metrics/test_stats.py::test_prompt_token_stats_all_non_negative[99-100-100] PASSED

All 19 tests in test_stats.py pass. All 3 tests in test_invalid_blocks_correctness.py pass.

Changed files

tests/v1/metrics/test_stats.py (modified, +81/-0)
vllm/v1/core/sched/scheduler.py (modified, +8/-1)
vllm/v1/metrics/stats.py (modified, +8/-0)

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Repository: vllm-project/vllm
Author: markmc
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36812

Description (problem / solution / changelog)

Since num_computed_tokens, num_cached_tokens, and num_external_computed_tokens accounting seems quite brittle currently - with preemption reset bugs and P/D disaggregation accounting issues - add a defensive check to detect and prevent instances of Prometheus counter errors:

ValueError: Counters can only be incremented by non-negative amounts

The invariant check enforces:

prompt_len >= num_cached_tokens >= num_external_computed_tokens >= 0

with the additional nuance that when all tokens are cached, the scheduler forces recomputation of the last token, so the:

num_external_computed_tokens <= num_cached_tokens + recomputed

When the invariant is violated, we log a a warning once with diagnostic details, and discard suspect cache metrics.

Obviously, the accounting should be fixed and made more robust and future-proof, at which point we can remove this check (perhaps replacing with a simple assertion).

Related to issues #36533, #36755 and PRs #36638, #36752, #36757.

Changed files

tests/v1/metrics/test_stats.py (modified, +82/-6)
vllm/v1/metrics/stats.py (modified, +66/-2)

Code Example

vllm serve nvidia/Llama-3.3-70B-Instruct-FP4 --host 0.0.0.0 --port 8888 --config /workspace/results/config.yaml --max-num-seqs 1024 --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --attention-config.use_trtllm_attention=0 --kv_offloading_backend native --kv_offloading_size 100 --disable-hybrid-kv-cache-manager

---

(APIServer pid=1327397) INFO 03-09 14:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 770.3 tokens/s, Running: 256 reqs, Waiting: 764 reqs, GPU KV cache usage: 97.5%, Prefix cache hit rate: 4.3%, External prefix cache hit rate: 0.0%
(APIServer pid=1327397) INFO 03-09 14:34:35 [metrics.py:103] KV Transfer metrics: CPU_to_GPU_total_bytes=810024960, CPU_to_GPU_total_time=0.19331251902878285, GPU_to_CPU_total_bytes=55050240, GPU_to_CPU_total_time=0.0010710399970412254
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] AsyncLLM output_handler failed.
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] Traceback (most recent call last):
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 703, in output_handler
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger_manager.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1309, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     self.counter_prompt_tokens_by_source[source][engine_idx].inc(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     raise ValueError('Counters can only be incremented by non-negative amounts.')
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts.
(APIServer pid=1327397) ERROR 03-09 14:34:37 [serving.py:1386] Error in chat completion stream generator.

---

(EngineCore_DP0 pid=1327677) WARNING 03-09 14:33:09 [offloading_connector.py:435] Request chatcmpl-95b358ced4243ecd-8fd7556a: cannot store 579 blocks

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

vLLM 0.16.0

</details>

🐛 Describe the bug

Run server with:

vllm serve nvidia/Llama-3.3-70B-Instruct-FP4 --host 0.0.0.0 --port 8888 --config /workspace/results/config.yaml --max-num-seqs 1024 --gpu-memory-utilization 0.9 --tensor-parallel-size 1 --attention-config.use_trtllm_attention=0 --kv_offloading_backend native --kv_offloading_size 100 --disable-hybrid-kv-cache-manager

Run basic multiturn benchmark client against server with concurrency equal to 2048 (and --max-num-seqs set accordingly).

Observe error:

(APIServer pid=1327397) INFO 03-09 14:34:35 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 770.3 tokens/s, Running: 256 reqs, Waiting: 764 reqs, GPU KV cache usage: 97.5%, Prefix cache hit rate: 4.3%, External prefix cache hit rate: 0.0%
(APIServer pid=1327397) INFO 03-09 14:34:35 [metrics.py:103] KV Transfer metrics: CPU_to_GPU_total_bytes=810024960, CPU_to_GPU_total_time=0.19331251902878285, GPU_to_CPU_total_bytes=55050240, GPU_to_CPU_total_time=0.0010710399970412254
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) INFO:     127.0.0.1:58324 - "GET /metrics HTTP/1.1" 200 OK
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] AsyncLLM output_handler failed.
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] Traceback (most recent call last):
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/engine/async_llm.py", line 703, in output_handler
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger_manager.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1309, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     logger.record(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/vllm/v1/metrics/loggers.py", line 1113, in record
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     self.counter_prompt_tokens_by_source[source][engine_idx].inc(
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]   File "/usr/local/lib/python3.12/dist-packages/prometheus_client/metrics.py", line 339, in inc
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710]     raise ValueError('Counters can only be incremented by non-negative amounts.')
(APIServer pid=1327397) ERROR 03-09 14:34:36 [async_llm.py:710] ValueError: Counters can only be incremented by non-negative amounts.
(APIServer pid=1327397) ERROR 03-09 14:34:37 [serving.py:1386] Error in chat completion stream generator.

Note that this is definitely related to the KV offload connector metrics as it only occurs when KV offloading (native) is enabled. Also, this only appears to occur when the offload cache is full (indicated by warnings as shown below):

(EngineCore_DP0 pid=1327677) WARNING 03-09 14:33:09 [offloading_connector.py:435] Request chatcmpl-95b358ced4243ecd-8fd7556a: cannot store 579 blocks

extent analysis

Fix Plan

To resolve the issue, we need to address the ValueError exception caused by attempting to increment a Prometheus counter by a non-negative amount. This error occurs when the KV offload cache is full and the native offloading backend is enabled.

Here are the steps to fix the issue:

Modify the offloading_connector.py file to handle the case when the offload cache is full. We can do this by adding a check before attempting to store blocks in the cache.
Update the loggers.py file to handle the ValueError exception when recording metrics.

Code Changes

# offloading_connector.py
def store_blocks(self, blocks):
    if self.cache_full():
        # Handle the case when the cache is full
        logger.warning("Offload cache is full. Cannot store blocks.")
        return
    # Store blocks in the cache
    ...

def cache_full(self):
    # Check if the cache is full
    return self.cache_size >= self.max_cache_size

# loggers.py
def record(self, metric, value):
    try:
        # Record the metric
        self.counter_prompt_tokens_by_source[source][engine_idx].inc(value)
    except ValueError as e:
        # Handle the ValueError exception
        logger.error(f"Error recording metric: {e}")

Verification

To verify that the fix worked, run the server with the same configuration and benchmark client. The ValueError exception should no longer occur, and the server should handle the case when the offload cache is full without errors.

Extra Tips

Make sure to update the offloading_connector.py and loggers.py files with the correct changes.
Test the changes thoroughly to ensure that the issue is resolved.
Consider adding additional logging and monitoring to detect similar issues in the future.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #vector store #embedding generation #cache error #pipeline error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency [2 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #36638: [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: Negative prompt token counter crashes engine under CPU KV offloading + high concurrency [2 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #36638: [WIP][Bugfix] Fix negative prompt token counter crash under KV offloading

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

PR #36812: [Metrics] Temporary band-aid for "Counters can only be incremented by non-negative amounts"

Description (problem / solution / changelog)

Changed files

Code Example

Your current environment

🐛 Describe the bug

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING