vllm - 💡(How to fix) Fix [Bug]: OverflowError in mamba_utils.collect_mamba_copy_meta on XPU when device pointer ≥ 2^63 (hybrid models with align-mode prefix caching) [1 participants]

vllm2026-05-06 12:51:27

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#41817•Fetched 2026-05-07 03:32:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

florianhoffmann305

Participants

florianhoffmann305

Timeline (top)

labeled ×1

Error Message

ERROR EngineCore encountered a fatal error.
Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1129, in run_engine_core
    engine_core.run_busy_loop()
  File "vllm/v1/engine/core.py", line 1170, in run_busy_loop
    self._process_engine_step()
  File "vllm/v1/engine/core.py", line 1209, in _process_engine_step
    outputs, model_executed = self.step_fn()
  File "vllm/v1/engine/core.py", line 473, in step_with_batch_queue
    exec_future = self.model_executor.execute_model(...)
  ...
  File "vllm/v1/worker/gpu_model_runner.py", line 3956, in execute_model
    mamba_utils.preprocess_mamba(...)
  File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
  File "vllm/v1/worker/mamba_utils.py", line 128, in collect_mamba_copy_meta
    src_ptrs_np[offset] = copy_spec.start_addr
OverflowError: Python int too large to convert to C long

Root Cause

vllm/v1/worker/mamba_utils.py lines 89-90:

src_ptrs=make_buffer(n, dtype=torch.int64),
dst_ptrs=make_buffer(n, dtype=torch.int64),

Lines 128-129:

src_ptrs_np[offset] = copy_spec.start_addr            # <-- crashes
dst_ptrs_np[offset] = state[dest_block_id].data_ptr()

copy_spec.start_addr is src_state.data_ptr(), an unsigned device address. The numpy view is signed. The Triton kernel batch_memcpy_kernel casts loaded integers to tl.pointer_type(tl.uint8) — a bitwise reinterpretation — so storing the pointer as a bit-identical int64 value is safe for kernel correctness. The problem is purely in the numpy-level assignment.

Fix Action

Fix / Workaround

Patch verified locally on Intel Arc Pro B70 with Qwen3.6-35B-A3B (uniform W4A16). After the fix:

Code Example

vLLM version: 0.20.1rc1.dev105+g83fec0428
Platform:     XPU (Intel Arc Pro B70, BMG-G31 / Xe2 Battlemage, 32 GiB)
OS:           Ubuntu 25.10, kernel 6.x
Python:       3.12.13
PyTorch:      2.11.0+xpu
vllm-xpu-kernels: v0.1.7
triton-xpu:   3.7.0
oneAPI:       2025.x
Model:        Qwen3.6-35B-A3B (architecture: Qwen3_5MoeForConditionalGeneration)
              uniform W4A16 compressed-tensors quantization

---

File "vllm/v1/worker/mamba_utils.py", line 128, in collect_mamba_copy_meta
    src_ptrs_np[offset] = copy_spec.start_addr
OverflowError: Python int too large to convert to C long

---

ERROR EngineCore encountered a fatal error.
Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1129, in run_engine_core
    engine_core.run_busy_loop()
  File "vllm/v1/engine/core.py", line 1170, in run_busy_loop
    self._process_engine_step()
  File "vllm/v1/engine/core.py", line 1209, in _process_engine_step
    outputs, model_executed = self.step_fn()
  File "vllm/v1/engine/core.py", line 473, in step_with_batch_queue
    exec_future = self.model_executor.execute_model(...)
  ...
  File "vllm/v1/worker/gpu_model_runner.py", line 3956, in execute_model
    mamba_utils.preprocess_mamba(...)
  File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
  File "vllm/v1/worker/mamba_utils.py", line 128, in collect_mamba_copy_meta
    src_ptrs_np[offset] = copy_spec.start_addr
OverflowError: Python int too large to convert to C long

---

num_computed_tokens=[2112]
num_output_tokens=[1312]
new_block_ids=[([5], [6], [7], [8])]
new_block_ids_to_zero=[8]
total_num_scheduled_tokens=1

---

src_ptrs=make_buffer(n, dtype=torch.int64),
dst_ptrs=make_buffer(n, dtype=torch.int64),

---

src_ptrs_np[offset] = copy_spec.start_addr            # <-- crashes
dst_ptrs_np[offset] = state[dest_block_id].data_ptr()

---

--- a/vllm/v1/worker/mamba_utils.py
+++ b/vllm/v1/worker/mamba_utils.py
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import dataclasses
 import itertools
+import numpy as np
 from collections.abc import Callable
 from typing import Any
@@ -108,8 +109,10 @@ def collect_mamba_copy_meta(
     if src_block_idx == dest_block_idx and accept_token_bias == 0:
         return
 
-    src_ptrs_np = copy_bufs.src_ptrs.np
-    dst_ptrs_np = copy_bufs.dst_ptrs.np
+    # Use uint64 view so that device pointers with the MSB set (addr >= 2^63),
+    # which occur on XPU, are stored by bit pattern rather than rejected as
+    # signed-int64 overflow. Triton reads opaque integers and casts to
+    # tl.pointer_type, so signedness at the numpy layer doesn't affect the kernel.
+    src_ptrs_np = copy_bufs.src_ptrs.np.view(np.uint64)
+    dst_ptrs_np = copy_bufs.dst_ptrs.np.view(np.uint64)
     sizes_np = copy_bufs.sizes.np
     offset = copy_bufs.offset

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>Click to expand environment</summary>

vLLM version: 0.20.1rc1.dev105+g83fec0428
Platform:     XPU (Intel Arc Pro B70, BMG-G31 / Xe2 Battlemage, 32 GiB)
OS:           Ubuntu 25.10, kernel 6.x
Python:       3.12.13
PyTorch:      2.11.0+xpu
vllm-xpu-kernels: v0.1.7
triton-xpu:   3.7.0
oneAPI:       2025.x
Model:        Qwen3.6-35B-A3B (architecture: Qwen3_5MoeForConditionalGeneration)
              uniform W4A16 compressed-tensors quantization

</details>

🐛 Describe the bug

collect_mamba_copy_meta in vllm/v1/worker/mamba_utils.py deterministically crashes on XPU at the first multi-block mamba state copy during decode for any hybrid (attention + GDN/Mamba) model with prefix caching enabled in align mode.

The buffers src_ptrs and dst_ptrs are allocated as torch.int64 (signed). On XPU, device pointers returned by tensor.data_ptr() can have the most-significant bit set — i.e., values in the range [2^63, 2^64 − 1]. These are valid 64-bit pointers when interpreted as unsigned, but they overflow np.int64 when numpy tries to assign them via PyLong_AsLong():

File "vllm/v1/worker/mamba_utils.py", line 128, in collect_mamba_copy_meta
    src_ptrs_np[offset] = copy_spec.start_addr
OverflowError: Python int too large to convert to C long

Why this is XPU-specific

The bug is in platform-agnostic Python code, but only triggers on platforms whose memory allocator returns addresses in the upper half of the 64-bit address space:

CUDA on x86_64 Linux: virtual addresses typically < 2^47 → fit in signed int64 → bug latent.
XPU on Intel Arc discrete GPUs: device-local allocations can land at addresses ≥ 2^63 → overflow.

The dtype=torch.int64 allocation traces back to the original align-mode introduction in #30877, where buffers were initially CUDA tensors. The signed-int64 assumption has been present since day one and worked for CUDA; XPU exposes it.

Reproduction

Required configuration:

Any hybrid attention + Mamba/GDN model on XPU
--enable-prefix-caching
--mamba-cache-mode align (this is the default for Qwen3_5MoeForConditionalGeneration)

Trigger: First multi-block mamba state copy during decode. With the auto-computed block_size=2112 for Qwen3.6-35B-A3B on XPU, this fires deterministically once a request crosses 2112 cumulative tokens (prefill + generated).

Determinism: The XPU allocator returns the same high-MSB addresses on every run, so the crash reproduces at the same scheduler step on every invocation.

Stack trace

ERROR EngineCore encountered a fatal error.
Traceback (most recent call last):
  File "vllm/v1/engine/core.py", line 1129, in run_engine_core
    engine_core.run_busy_loop()
  File "vllm/v1/engine/core.py", line 1170, in run_busy_loop
    self._process_engine_step()
  File "vllm/v1/engine/core.py", line 1209, in _process_engine_step
    outputs, model_executed = self.step_fn()
  File "vllm/v1/engine/core.py", line 473, in step_with_batch_queue
    exec_future = self.model_executor.execute_model(...)
  ...
  File "vllm/v1/worker/gpu_model_runner.py", line 3956, in execute_model
    mamba_utils.preprocess_mamba(...)
  File "vllm/v1/worker/mamba_utils.py", line 207, in preprocess_mamba
    collect_mamba_copy_meta(...)
  File "vllm/v1/worker/mamba_utils.py", line 128, in collect_mamba_copy_meta
    src_ptrs_np[offset] = copy_spec.start_addr
OverflowError: Python int too large to convert to C long

Scheduler dump at crash

num_computed_tokens=[2112]
num_output_tokens=[1312]
new_block_ids=[([5], [6], [7], [8])]
new_block_ids_to_zero=[8]
total_num_scheduled_tokens=1

The num_computed_tokens=2112 exactly matches the architecture's auto-computed block_size, confirming the crash fires precisely at the first block-boundary crossing.

Root cause

vllm/v1/worker/mamba_utils.py lines 89-90:

src_ptrs=make_buffer(n, dtype=torch.int64),
dst_ptrs=make_buffer(n, dtype=torch.int64),

Lines 128-129:

src_ptrs_np[offset] = copy_spec.start_addr            # <-- crashes
dst_ptrs_np[offset] = state[dest_block_id].data_ptr()

Proposed fix

Reinterpret the numpy view as uint64 so that pointer values are stored by bit pattern rather than rejected as signed-int overflow. The underlying torch.int64 tensor is unchanged.

--- a/vllm/v1/worker/mamba_utils.py
+++ b/vllm/v1/worker/mamba_utils.py
@@ -1,6 +1,7 @@
 # SPDX-License-Identifier: Apache-2.0
 # SPDX-FileCopyrightText: Copyright contributors to the vLLM project
 import dataclasses
 import itertools
+import numpy as np
 from collections.abc import Callable
 from typing import Any
@@ -108,8 +109,10 @@ def collect_mamba_copy_meta(
     if src_block_idx == dest_block_idx and accept_token_bias == 0:
         return
 
-    src_ptrs_np = copy_bufs.src_ptrs.np
-    dst_ptrs_np = copy_bufs.dst_ptrs.np
+    # Use uint64 view so that device pointers with the MSB set (addr >= 2^63),
+    # which occur on XPU, are stored by bit pattern rather than rejected as
+    # signed-int64 overflow. Triton reads opaque integers and casts to
+    # tl.pointer_type, so signedness at the numpy layer doesn't affect the kernel.
+    src_ptrs_np = copy_bufs.src_ptrs.np.view(np.uint64)
+    dst_ptrs_np = copy_bufs.dst_ptrs.np.view(np.uint64)
     sizes_np = copy_bufs.sizes.np
     offset = copy_bufs.offset

Validation

Patch verified locally on Intel Arc Pro B70 with Qwen3.6-35B-A3B (uniform W4A16). After the fix:

Decode crosses the 2112-token block boundary cleanly with no OverflowError.
Sustained throughput: ~63 t/s decode, 5800–7600 t/s prefill (similar to pre-prefix-caching baseline).
Server-side metric Prefix cache hit rate climbs from 0.0% on first request to 26-44% on subsequent requests sharing prefixes, confirming the cache is now operational on this code path.
Stack survives at least multiple thousand cumulative tokens of decode and multiple sequential requests with shared long prefixes.

Risk assessment

Concern	Assessment
CUDA path	`view(np.uint64)` is a no-op for values that already fit in signed int64 (CUDA addresses on x86_64 Linux are typically < 2^47). Behavior unchanged.
Other hybrid models (Jamba, NemotronH, Plamo2, Bailing-MoE, IBM Granite 4)	Same fix applies uniformly via `MambaCopyBuffers`; no model-specific risk.
Triton kernel correctness	Safe. `batch_memcpy_kernel` casts to `tl.pointer_type(tl.uint8)`, a bitwise reinterpretation; signedness is irrelevant at the kernel boundary.
`sizes_np` buffer (dtype int32)	Not affected — element counts are small positive integers.
Future torch.uint64 migration	If `CpuGpuBuffer` ever switches to `torch.uint64`, the `.view()` call becomes a no-op and can be cleanly removed.

Test coverage gap

The existing tests tests/v1/e2e/general/test_mamba_prefix_cache.py and tests/v1/worker/test_mamba_utils.py cover the align-mode copy path but run on CPU/CUDA where addresses fit in signed int64. A targeted unit test mocking data_ptr() to return a value ≥ 2^63 would catch this class of bug independent of hardware. Happy to contribute such a test alongside the fix if there's interest.

Related: mamba_block_size CLI flag silently ignored in align mode

While investigating I found that --mamba-block-size is silently swallowed when mamba_cache_mode == "align". In vllm/platforms/interface.py (around lines 596–613), the align branch unconditionally resets cache_config.mamba_block_size = cache_config.block_size regardless of user_specified_mamba_block_size. This is a separate UX issue — the flag is accepted on the CLI and stored in config but has no observable effect. Should I file a separate issue for that?

Additional context

Investigation report and full root-cause analysis available on request.
Architectures known to be affected: any hybrid model using MambaCopyBuffers on XPU. Confirmed Qwen3.6-35B-A3B (Qwen3_5MoeForConditionalGeneration); architecturally also Qwen3-Next, Jamba, NemotronH, Plamo2, Bailing-MoE, Granite 4.
XPU prefix caching for hybrid models has been blocking for the Intel Arc Pro B70 production deployment of Qwen3.6 specifically; this fix unblocks it cleanly.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #runtime error #dependency conflict #environment setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: OverflowError in mamba_utils.collect_mamba_copy_meta on XPU when device pointer ≥ 2^63 (hybrid models with align-mode prefix caching) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

🐛 Describe the bug

Why this is XPU-specific

Reproduction

Stack trace

Scheduler dump at crash

Root cause

Proposed fix

Validation

Risk assessment

Test coverage gap

Related: mamba_block_size CLI flag silently ignored in align mode

Additional context

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: OverflowError in mamba_utils.collect_mamba_copy_meta on XPU when device pointer ≥ 2^63 (hybrid models with align-mode prefix caching) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

🐛 Describe the bug

Why this is XPU-specific

Reproduction

Stack trace

Scheduler dump at crash

Root cause

Proposed fix

Validation

Risk assessment

Test coverage gap

Related: mamba_block_size CLI flag silently ignored in align mode

Additional context

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING