vllm - 💡(How to fix) Fix [Bug]: extract_hidden_states CUDA graph padding writes PADDING_SLOT_ID=-1 into last hidden-state KV slot

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

parser.error("use --run-all or --case")

Fix Action

Fix / Workaround

With the local mask patch enabled, the same padded -1 rows are present, but the sentinel remains the real row:

I ran this from the vLLM checkout after applying the optional instrumentation/mask patch. On unpatched vLLM, compare --case eager and --case graph-current; --case graph-masked requires the local diff below.

Run from a vLLM checkout after applying the instrumentation patch:

Code Example

kv_cache[slot_mapping // block_size, slot_mapping % block_size] = to_cache

---

[EHS_REPRO] ExtractHiddenStatesProposer._get_slot_mapping num_actual=128 num_input_tokens=128 padding_count=64 layer_count=1
[EHS_REPRO] basic_cache before seq_len=128 kv_blocks=5 block_size=16 contains_padding=True num_padding=64 sentinel_slot=79 real_rows_targeting_sentinel=[63] padding_rows_targeting_sentinel=[64, ..., 127] sentinel_norm_before=0.000000 sentinel_sum_before=0.000000
[EHS_REPRO] basic_cache padded_row row=127 norm=0.000000 sum=0.000000 mean=0.000000
[EHS_REPRO] basic_cache real_sentinel_row row=63 norm=181.725067 sum=4352.000000 mean=5.666667
[EHS_REPRO] basic_cache after sentinel_norm_after=0.000000 sentinel_sum_after=0.000000 sentinel_mean_after=0.000000

---

eager:         max_abs_diff=0.0,  mismatch_rows=[]
graph-current: max_abs_diff=10.0, mismatch_rows=[63], last_row_mean=0.0, last_row_sum=0.0
graph-masked:  max_abs_diff=0.0,  mismatch_rows=[]

---

[EHS_REPRO] basic_cache after sentinel_norm_after=181.725067 sentinel_sum_after=4352.000000 sentinel_mean_after=5.666667

---

EHS_REPRO_TRACE=1 python repro/ehs_padding_repro.py --run-all --work-dir /tmp/ehs-padding-repro

---

#!/usr/bin/env python3
"""Focused repro for extract_hidden_states CUDA graph padding.

Run from a vLLM checkout after applying the instrumentation patch:

    EHS_REPRO_TRACE=1 python repro/ehs_padding_repro.py --run-all

The graph case constrains the hidden-state cache to five 16-token blocks and
uses two prompts with 64 and 16 tokens. The two prefills fill all 80 physical
slots, then CUDA graph padding raises the drafter input size to 128. Every
padded -1 slot mapping aliases kv_cache[-1, 15], which is also the real slot
for the second request's last prompt row.
"""

from __future__ import annotations

import argparse
import gc
import json
import os
import subprocess
import sys
import tempfile
from collections.abc import Iterable
from pathlib import Path
from typing import Any

import torch
import torch.nn as nn
from safetensors import safe_open
from transformers import LlamaConfig
from vllm.model_executor.models.interfaces import EagleModelMixin


LAYER_IDS = [5, 2, 10]
HIDDEN_SIZE = 256
PROMPT_LENGTHS = [64, 16]
BLOCK_SIZE = 16
NUM_GPU_BLOCKS = 5


class PredictableLlamaModel(nn.Module, EagleModelMixin):
    def __init__(self, *, vllm_config, prefix: str = ""):
        super().__init__()
        self.config = vllm_config.model_config.hf_config

        from vllm.model_executor.layers.vocab_parallel_embedding import (
            VocabParallelEmbedding,
        )
        from vllm.model_executor.models.utils import (
            make_empty_intermediate_tensors_factory,
        )

        self.embed_tokens = VocabParallelEmbedding(
            self.config.vocab_size,
            self.config.hidden_size,
        )
        self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(
            ["hidden_states", "residual"], self.config.hidden_size
        )

    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.embed_tokens(input_ids)

    def forward(
        self,
        input_ids: torch.Tensor | None,
        positions: torch.Tensor,
        intermediate_tensors: Any | None,
        inputs_embeds: torch.Tensor | None = None,
        **extra_layer_kwargs: Any,
    ) -> torch.Tensor | tuple[torch.Tensor, list[torch.Tensor]]:
        if inputs_embeds is not None:
            seq_len = inputs_embeds.shape[0]
            device = inputs_embeds.device
        elif input_ids is not None:
            seq_len = input_ids.shape[0] if input_ids.ndim == 1 else input_ids.shape[-1]
            device = input_ids.device
        else:
            raise ValueError("Either input_ids or inputs_embeds must be provided")

        dtype = torch.bfloat16
        hidden_states = torch.full(
            (seq_len, self.config.hidden_size),
            fill_value=float(self.config.num_hidden_layers),
            device=device,
            dtype=dtype,
        )
        if len(self.aux_hidden_state_layers) > 0:
            aux_hidden_states = [
                torch.full(
                    (seq_len, self.config.hidden_size),
                    fill_value=float(layer_idx),
                    device=device,
                    dtype=dtype,
                )
                for layer_idx in self.aux_hidden_state_layers
            ]
            return hidden_states, aux_hidden_states
        return hidden_states

    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        return set()


def _register_predictable_model() -> None:
    from vllm import ModelRegistry
    from vllm.model_executor.models.interfaces import EagleModelMixin
    from vllm.model_executor.models.llama import LlamaForCausalLM

    if not issubclass(PredictableLlamaModel, EagleModelMixin):
        raise RuntimeError("PredictableLlamaModel must implement EagleModelMixin")

    class PredictableLlamaForCausalLM(LlamaForCausalLM):
        def _init_model(
            self,
            vllm_config,
            prefix: str = "",
            layer_type: type[nn.Module] | None = None,
        ):
            return PredictableLlamaModel(vllm_config=vllm_config, prefix=prefix)

        def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
            return set()

    if "PredictableLlamaForCausalLM" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "PredictableLlamaForCausalLM", PredictableLlamaForCausalLM
        )


def _write_model_config(model_dir: Path) -> None:
    model_dir.mkdir(parents=True, exist_ok=True)
    config = LlamaConfig(
        vocab_size=1000,
        hidden_size=HIDDEN_SIZE,
        intermediate_size=512,
        num_hidden_layers=24,
        num_attention_heads=4,
        num_key_value_heads=4,
        max_position_embeddings=128,
        architectures=["PredictableLlamaForCausalLM"],
        torch_dtype="bfloat16",
    )
    config.save_pretrained(model_dir)


def _load_hidden_states(path: str) -> tuple[torch.Tensor, torch.Tensor]:
    with safe_open(path, "pt") as f:
        token_ids = f.get_tensor("token_ids")
        hidden_states = f.get_tensor("hidden_states")
    return token_ids, hidden_states


def _case_config(case: str) -> tuple[bool, bool]:
    if case == "eager":
        return True, False
    if case == "graph-current":
        return False, False
    if case == "graph-masked":
        return False, True
    raise ValueError(f"unknown case {case!r}")


def run_case(case: str, work_dir: Path) -> dict[str, Any]:
    enforce_eager, mask_padding = _case_config(case)
    os.environ.setdefault("VLLM_USE_V1", "1")
    os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "fork")
    os.environ["EHS_REPRO_MASK_PADDING"] = "1" if mask_padding else "0"

    _register_predictable_model()

    from vllm import LLM, SamplingParams

    model_dir = work_dir / "predictable_llama"
    storage_dir = work_dir / f"hidden_states_{case}"
    _write_model_config(model_dir)
    storage_dir.mkdir(parents=True, exist_ok=True)

    compilation_config = None
    if not enforce_eager:
        compilation_config = {
            "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 128],
            "max_cudagraph_capture_size": 128,
            "cudagraph_num_of_warmups": 0,
        }

    llm = LLM(
        model=str(model_dir),
        skip_tokenizer_init=True,
        speculative_config={
            "method": "extract_hidden_states",
            "num_speculative_tokens": 1,
            "draft_model_config": {
                "hf_config": {"eagle_aux_hidden_state_layer_ids": LAYER_IDS}
            },
        },
        kv_transfer_config={
            "kv_connector": "ExampleHiddenStatesConnector",
            "kv_role": "kv_producer",
            "kv_connector_extra_config": {
                "shared_storage_path": str(storage_dir),
                "use_synchronization_lock": False,
            },
        },
        max_model_len=65,
        max_num_batched_tokens=128,
        max_num_seqs=2,
        block_size=BLOCK_SIZE,
        num_gpu_blocks_override=NUM_GPU_BLOCKS,
        enforce_eager=enforce_eager,
        enable_chunked_prefill=False,
        trust_remote_code=True,
        load_format="dummy",
        dtype="bfloat16",
        gpu_memory_utilization=0.25,
        compilation_config=compilation_config,
    )

    prompt_token_ids_by_request = [
        list(range(10, 10 + PROMPT_LENGTHS[0])),
        list(range(200, 200 + PROMPT_LENGTHS[1])),
    ]
    sampling_params = SamplingParams(
        max_tokens=1,
        temperature=0.0,
        detokenize=False,
    )
    outputs = llm.generate(
        [{"prompt_token_ids": ids} for ids in prompt_token_ids_by_request],
        sampling_params,
    )

    request_results = []
    max_abs_diff = 0.0
    mismatch_requests = []
    for req_idx, (output, prompt_token_ids) in enumerate(
        zip(outputs, prompt_token_ids_by_request)
    ):
        hidden_states_path = output.kv_transfer_params["hidden_states_path"]
        token_ids, hidden_states = _load_hidden_states(hidden_states_path)

        prompt_len = len(prompt_token_ids)
        expected = torch.empty(
            (prompt_len, len(LAYER_IDS), HIDDEN_SIZE),
            dtype=hidden_states.dtype,
        )
        for idx, layer_id in enumerate(LAYER_IDS):
            expected[:, idx, :] = float(layer_id)
        diff = (hidden_states - expected).float().abs()
        req_max_abs_diff = float(diff.max().item())
        max_abs_diff = max(max_abs_diff, req_max_abs_diff)
        mismatch_rows = torch.nonzero(diff.amax(dim=(1, 2)) > 0).flatten().tolist()
        if mismatch_rows:
            mismatch_requests.append(req_idx)
        last_row = hidden_states[-1].float()
        request_results.append(
            {
                "request_index": req_idx,
                "prompt_len": prompt_len,
                "token_ids_match": torch.equal(
                    token_ids, torch.tensor(prompt_token_ids)
                ),
                "hidden_states_shape": list(hidden_states.shape),
                "max_abs_diff": req_max_abs_diff,
                "mismatch_rows": mismatch_rows,
                "last_row_sum": float(last_row.sum().item()),
                "last_row_mean": float(last_row.mean().item()),
                "hidden_states_path": hidden_states_path,
            }
        )

    result = {
        "case": case,
        "enforce_eager": enforce_eager,
        "mask_padding": mask_padding,
        "prompt_lengths": PROMPT_LENGTHS,
        "block_size": BLOCK_SIZE,
        "num_gpu_blocks_override": NUM_GPU_BLOCKS,
        "total_prompt_tokens": sum(PROMPT_LENGTHS),
        "max_abs_diff": max_abs_diff,
        "mismatch_requests": mismatch_requests,
        "requests": request_results,
        "torch": torch.__version__,
        "cuda_runtime": torch.version.cuda,
        "device": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
    }

    del llm
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    return result


def run_all(work_dir: Path) -> int:
    cases = ["eager", "graph-current", "graph-masked"]
    all_results = []
    for case in cases:
        env = os.environ.copy()
        env.setdefault("EHS_REPRO_TRACE", "1")
        cmd = [
            sys.executable,
            str(Path(__file__).resolve()),
            "--case",
            case,
            "--work-dir",
            str(work_dir),
        ]
        print(f"[EHS_REPRO] running {' '.join(cmd)}", flush=True)
        proc = subprocess.run(cmd, env=env, text=True, capture_output=True)
        print(proc.stdout, end="")
        print(proc.stderr, end="", file=sys.stderr)
        if proc.returncode != 0:
            return proc.returncode
        marker = "[EHS_REPRO_RESULT] "
        result_lines = [line for line in proc.stdout.splitlines() if line.startswith(marker)]
        if not result_lines:
            print(f"missing result line for {case}", file=sys.stderr)
            return 2
        all_results.append(json.loads(result_lines[-1][len(marker) :]))

    print("[EHS_REPRO_ALL_RESULTS] " + json.dumps(all_results, sort_keys=True))
    return 0


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--case", choices=["eager", "graph-current", "graph-masked"])
    parser.add_argument("--run-all", action="store_true")
    parser.add_argument("--work-dir", type=Path)
    args = parser.parse_args()

    if args.work_dir is None:
        args.work_dir = Path(tempfile.mkdtemp(prefix="ehs-padding-repro-"))
    args.work_dir.mkdir(parents=True, exist_ok=True)

    if args.run_all:
        return run_all(args.work_dir)
    if args.case is None:
        parser.error("use --run-all or --case")

    result = run_case(args.case, args.work_dir)
    print("[EHS_REPRO_RESULT] " + json.dumps(result, sort_keys=True), flush=True)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())

---

diff --git a/vllm/model_executor/models/extract_hidden_states.py b/vllm/model_executor/models/extract_hidden_states.py
index 000000000..000000000 100644
--- a/vllm/model_executor/models/extract_hidden_states.py
+++ b/vllm/model_executor/models/extract_hidden_states.py
@@
 def basic_cache(
     to_cache: torch.Tensor,  # shape: [seq_len, num_heads, head_size]
     kv_cache: torch.Tensor,  # shape: [num_blocks, block_size, num_heads, head_size]
     slot_mapping: torch.Tensor,  # shape: [seq_len]
 ):
     block_size = kv_cache.shape[1]
-    kv_cache[slot_mapping // block_size, slot_mapping % block_size] = to_cache
+    valid = slot_mapping >= 0
+    kv_cache[
+        slot_mapping[valid] // block_size,
+        slot_mapping[valid] % block_size,
+    ] = to_cache[valid]
RAW_BUFFERClick to expand / collapse

Environment

  • vLLM commit: 284e6f543d462016fc80c055ccbf088832c63129
  • vLLM version: 0.1.dev1+g284e6f543
  • Python: 3.10.12
  • PyTorch: 2.11.0+cu130
  • CUDA runtime reported by PyTorch: 13.0
  • GPU: NVIDIA A100-SXM4-40GB, 40960 MiB
  • NVIDIA driver: 580.159.03
  • Machine used for repro: GCP a2-highgpu-1g Spot VM in asia-northeast1-c

Bug

ExtractHiddenStatesProposer pads slot mappings with PADDING_SLOT_ID = -1 for CUDA graph execution. The cache-only attention path then writes hidden states with:

kv_cache[slot_mapping // block_size, slot_mapping % block_size] = to_cache

That treats padded slot_mapping == -1 as a real negative index, so padding rows write to kv_cache[-1, block_size - 1]. In the repro below, CUDA graph padding takes a 64-token drafter call to 128 rows, and the 64 padded rows overwrite the real hidden state in physical slot 79.

This may be related to #39247, but this report is specifically about deterministic hidden-state corruption rather than an illegal memory access.

Exact flags / shape

  • speculative_config.method = "extract_hidden_states"
  • num_speculative_tokens = 1
  • ExampleHiddenStatesConnector as KV producer
  • CUDA graph enabled: enforce_eager=False
  • compilation_config={"cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 128], "max_cudagraph_capture_size": 128, "cudagraph_num_of_warmups": 0}
  • max_model_len=65, max_num_batched_tokens=128, max_num_seqs=2
  • block_size=16, num_gpu_blocks_override=5
  • prompts of lengths [64, 16]; the first prompt maps row 63 to sentinel physical slot 79

Instrumented evidence

Current code, CUDA graph path:

[EHS_REPRO] ExtractHiddenStatesProposer._get_slot_mapping num_actual=128 num_input_tokens=128 padding_count=64 layer_count=1
[EHS_REPRO] basic_cache before seq_len=128 kv_blocks=5 block_size=16 contains_padding=True num_padding=64 sentinel_slot=79 real_rows_targeting_sentinel=[63] padding_rows_targeting_sentinel=[64, ..., 127] sentinel_norm_before=0.000000 sentinel_sum_before=0.000000
[EHS_REPRO] basic_cache padded_row row=127 norm=0.000000 sum=0.000000 mean=0.000000
[EHS_REPRO] basic_cache real_sentinel_row row=63 norm=181.725067 sum=4352.000000 mean=5.666667
[EHS_REPRO] basic_cache after sentinel_norm_after=0.000000 sentinel_sum_after=0.000000 sentinel_mean_after=0.000000

Saved hidden-state comparison:

eager:         max_abs_diff=0.0,  mismatch_rows=[]
graph-current: max_abs_diff=10.0, mismatch_rows=[63], last_row_mean=0.0, last_row_sum=0.0
graph-masked:  max_abs_diff=0.0,  mismatch_rows=[]

With the local mask patch enabled, the same padded -1 rows are present, but the sentinel remains the real row:

[EHS_REPRO] basic_cache after sentinel_norm_after=181.725067 sentinel_sum_after=4352.000000 sentinel_mean_after=5.666667

Repro script

I ran this from the vLLM checkout after applying the optional instrumentation/mask patch. On unpatched vLLM, compare --case eager and --case graph-current; --case graph-masked requires the local diff below.

Run command:

EHS_REPRO_TRACE=1 python repro/ehs_padding_repro.py --run-all --work-dir /tmp/ehs-padding-repro
<details> <summary>ehs_padding_repro.py</summary>
#!/usr/bin/env python3
"""Focused repro for extract_hidden_states CUDA graph padding.

Run from a vLLM checkout after applying the instrumentation patch:

    EHS_REPRO_TRACE=1 python repro/ehs_padding_repro.py --run-all

The graph case constrains the hidden-state cache to five 16-token blocks and
uses two prompts with 64 and 16 tokens. The two prefills fill all 80 physical
slots, then CUDA graph padding raises the drafter input size to 128. Every
padded -1 slot mapping aliases kv_cache[-1, 15], which is also the real slot
for the second request's last prompt row.
"""

from __future__ import annotations

import argparse
import gc
import json
import os
import subprocess
import sys
import tempfile
from collections.abc import Iterable
from pathlib import Path
from typing import Any

import torch
import torch.nn as nn
from safetensors import safe_open
from transformers import LlamaConfig
from vllm.model_executor.models.interfaces import EagleModelMixin


LAYER_IDS = [5, 2, 10]
HIDDEN_SIZE = 256
PROMPT_LENGTHS = [64, 16]
BLOCK_SIZE = 16
NUM_GPU_BLOCKS = 5


class PredictableLlamaModel(nn.Module, EagleModelMixin):
    def __init__(self, *, vllm_config, prefix: str = ""):
        super().__init__()
        self.config = vllm_config.model_config.hf_config

        from vllm.model_executor.layers.vocab_parallel_embedding import (
            VocabParallelEmbedding,
        )
        from vllm.model_executor.models.utils import (
            make_empty_intermediate_tensors_factory,
        )

        self.embed_tokens = VocabParallelEmbedding(
            self.config.vocab_size,
            self.config.hidden_size,
        )
        self.make_empty_intermediate_tensors = make_empty_intermediate_tensors_factory(
            ["hidden_states", "residual"], self.config.hidden_size
        )

    def embed_input_ids(self, input_ids: torch.Tensor) -> torch.Tensor:
        return self.embed_tokens(input_ids)

    def forward(
        self,
        input_ids: torch.Tensor | None,
        positions: torch.Tensor,
        intermediate_tensors: Any | None,
        inputs_embeds: torch.Tensor | None = None,
        **extra_layer_kwargs: Any,
    ) -> torch.Tensor | tuple[torch.Tensor, list[torch.Tensor]]:
        if inputs_embeds is not None:
            seq_len = inputs_embeds.shape[0]
            device = inputs_embeds.device
        elif input_ids is not None:
            seq_len = input_ids.shape[0] if input_ids.ndim == 1 else input_ids.shape[-1]
            device = input_ids.device
        else:
            raise ValueError("Either input_ids or inputs_embeds must be provided")

        dtype = torch.bfloat16
        hidden_states = torch.full(
            (seq_len, self.config.hidden_size),
            fill_value=float(self.config.num_hidden_layers),
            device=device,
            dtype=dtype,
        )
        if len(self.aux_hidden_state_layers) > 0:
            aux_hidden_states = [
                torch.full(
                    (seq_len, self.config.hidden_size),
                    fill_value=float(layer_idx),
                    device=device,
                    dtype=dtype,
                )
                for layer_idx in self.aux_hidden_state_layers
            ]
            return hidden_states, aux_hidden_states
        return hidden_states

    def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
        return set()


def _register_predictable_model() -> None:
    from vllm import ModelRegistry
    from vllm.model_executor.models.interfaces import EagleModelMixin
    from vllm.model_executor.models.llama import LlamaForCausalLM

    if not issubclass(PredictableLlamaModel, EagleModelMixin):
        raise RuntimeError("PredictableLlamaModel must implement EagleModelMixin")

    class PredictableLlamaForCausalLM(LlamaForCausalLM):
        def _init_model(
            self,
            vllm_config,
            prefix: str = "",
            layer_type: type[nn.Module] | None = None,
        ):
            return PredictableLlamaModel(vllm_config=vllm_config, prefix=prefix)

        def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]) -> set[str]:
            return set()

    if "PredictableLlamaForCausalLM" not in ModelRegistry.get_supported_archs():
        ModelRegistry.register_model(
            "PredictableLlamaForCausalLM", PredictableLlamaForCausalLM
        )


def _write_model_config(model_dir: Path) -> None:
    model_dir.mkdir(parents=True, exist_ok=True)
    config = LlamaConfig(
        vocab_size=1000,
        hidden_size=HIDDEN_SIZE,
        intermediate_size=512,
        num_hidden_layers=24,
        num_attention_heads=4,
        num_key_value_heads=4,
        max_position_embeddings=128,
        architectures=["PredictableLlamaForCausalLM"],
        torch_dtype="bfloat16",
    )
    config.save_pretrained(model_dir)


def _load_hidden_states(path: str) -> tuple[torch.Tensor, torch.Tensor]:
    with safe_open(path, "pt") as f:
        token_ids = f.get_tensor("token_ids")
        hidden_states = f.get_tensor("hidden_states")
    return token_ids, hidden_states


def _case_config(case: str) -> tuple[bool, bool]:
    if case == "eager":
        return True, False
    if case == "graph-current":
        return False, False
    if case == "graph-masked":
        return False, True
    raise ValueError(f"unknown case {case!r}")


def run_case(case: str, work_dir: Path) -> dict[str, Any]:
    enforce_eager, mask_padding = _case_config(case)
    os.environ.setdefault("VLLM_USE_V1", "1")
    os.environ.setdefault("VLLM_WORKER_MULTIPROC_METHOD", "fork")
    os.environ["EHS_REPRO_MASK_PADDING"] = "1" if mask_padding else "0"

    _register_predictable_model()

    from vllm import LLM, SamplingParams

    model_dir = work_dir / "predictable_llama"
    storage_dir = work_dir / f"hidden_states_{case}"
    _write_model_config(model_dir)
    storage_dir.mkdir(parents=True, exist_ok=True)

    compilation_config = None
    if not enforce_eager:
        compilation_config = {
            "cudagraph_capture_sizes": [1, 2, 4, 8, 16, 32, 128],
            "max_cudagraph_capture_size": 128,
            "cudagraph_num_of_warmups": 0,
        }

    llm = LLM(
        model=str(model_dir),
        skip_tokenizer_init=True,
        speculative_config={
            "method": "extract_hidden_states",
            "num_speculative_tokens": 1,
            "draft_model_config": {
                "hf_config": {"eagle_aux_hidden_state_layer_ids": LAYER_IDS}
            },
        },
        kv_transfer_config={
            "kv_connector": "ExampleHiddenStatesConnector",
            "kv_role": "kv_producer",
            "kv_connector_extra_config": {
                "shared_storage_path": str(storage_dir),
                "use_synchronization_lock": False,
            },
        },
        max_model_len=65,
        max_num_batched_tokens=128,
        max_num_seqs=2,
        block_size=BLOCK_SIZE,
        num_gpu_blocks_override=NUM_GPU_BLOCKS,
        enforce_eager=enforce_eager,
        enable_chunked_prefill=False,
        trust_remote_code=True,
        load_format="dummy",
        dtype="bfloat16",
        gpu_memory_utilization=0.25,
        compilation_config=compilation_config,
    )

    prompt_token_ids_by_request = [
        list(range(10, 10 + PROMPT_LENGTHS[0])),
        list(range(200, 200 + PROMPT_LENGTHS[1])),
    ]
    sampling_params = SamplingParams(
        max_tokens=1,
        temperature=0.0,
        detokenize=False,
    )
    outputs = llm.generate(
        [{"prompt_token_ids": ids} for ids in prompt_token_ids_by_request],
        sampling_params,
    )

    request_results = []
    max_abs_diff = 0.0
    mismatch_requests = []
    for req_idx, (output, prompt_token_ids) in enumerate(
        zip(outputs, prompt_token_ids_by_request)
    ):
        hidden_states_path = output.kv_transfer_params["hidden_states_path"]
        token_ids, hidden_states = _load_hidden_states(hidden_states_path)

        prompt_len = len(prompt_token_ids)
        expected = torch.empty(
            (prompt_len, len(LAYER_IDS), HIDDEN_SIZE),
            dtype=hidden_states.dtype,
        )
        for idx, layer_id in enumerate(LAYER_IDS):
            expected[:, idx, :] = float(layer_id)
        diff = (hidden_states - expected).float().abs()
        req_max_abs_diff = float(diff.max().item())
        max_abs_diff = max(max_abs_diff, req_max_abs_diff)
        mismatch_rows = torch.nonzero(diff.amax(dim=(1, 2)) > 0).flatten().tolist()
        if mismatch_rows:
            mismatch_requests.append(req_idx)
        last_row = hidden_states[-1].float()
        request_results.append(
            {
                "request_index": req_idx,
                "prompt_len": prompt_len,
                "token_ids_match": torch.equal(
                    token_ids, torch.tensor(prompt_token_ids)
                ),
                "hidden_states_shape": list(hidden_states.shape),
                "max_abs_diff": req_max_abs_diff,
                "mismatch_rows": mismatch_rows,
                "last_row_sum": float(last_row.sum().item()),
                "last_row_mean": float(last_row.mean().item()),
                "hidden_states_path": hidden_states_path,
            }
        )

    result = {
        "case": case,
        "enforce_eager": enforce_eager,
        "mask_padding": mask_padding,
        "prompt_lengths": PROMPT_LENGTHS,
        "block_size": BLOCK_SIZE,
        "num_gpu_blocks_override": NUM_GPU_BLOCKS,
        "total_prompt_tokens": sum(PROMPT_LENGTHS),
        "max_abs_diff": max_abs_diff,
        "mismatch_requests": mismatch_requests,
        "requests": request_results,
        "torch": torch.__version__,
        "cuda_runtime": torch.version.cuda,
        "device": torch.cuda.get_device_name(0) if torch.cuda.is_available() else None,
    }

    del llm
    gc.collect()
    if torch.cuda.is_available():
        torch.cuda.empty_cache()
    return result


def run_all(work_dir: Path) -> int:
    cases = ["eager", "graph-current", "graph-masked"]
    all_results = []
    for case in cases:
        env = os.environ.copy()
        env.setdefault("EHS_REPRO_TRACE", "1")
        cmd = [
            sys.executable,
            str(Path(__file__).resolve()),
            "--case",
            case,
            "--work-dir",
            str(work_dir),
        ]
        print(f"[EHS_REPRO] running {' '.join(cmd)}", flush=True)
        proc = subprocess.run(cmd, env=env, text=True, capture_output=True)
        print(proc.stdout, end="")
        print(proc.stderr, end="", file=sys.stderr)
        if proc.returncode != 0:
            return proc.returncode
        marker = "[EHS_REPRO_RESULT] "
        result_lines = [line for line in proc.stdout.splitlines() if line.startswith(marker)]
        if not result_lines:
            print(f"missing result line for {case}", file=sys.stderr)
            return 2
        all_results.append(json.loads(result_lines[-1][len(marker) :]))

    print("[EHS_REPRO_ALL_RESULTS] " + json.dumps(all_results, sort_keys=True))
    return 0


def main() -> int:
    parser = argparse.ArgumentParser()
    parser.add_argument("--case", choices=["eager", "graph-current", "graph-masked"])
    parser.add_argument("--run-all", action="store_true")
    parser.add_argument("--work-dir", type=Path)
    args = parser.parse_args()

    if args.work_dir is None:
        args.work_dir = Path(tempfile.mkdtemp(prefix="ehs-padding-repro-"))
    args.work_dir.mkdir(parents=True, exist_ok=True)

    if args.run_all:
        return run_all(args.work_dir)
    if args.case is None:
        parser.error("use --run-all or --case")

    result = run_case(args.case, args.work_dir)
    print("[EHS_REPRO_RESULT] " + json.dumps(result, sort_keys=True), flush=True)
    return 0


if __name__ == "__main__":
    raise SystemExit(main())
</details>

Minimal diff

diff --git a/vllm/model_executor/models/extract_hidden_states.py b/vllm/model_executor/models/extract_hidden_states.py
index 000000000..000000000 100644
--- a/vllm/model_executor/models/extract_hidden_states.py
+++ b/vllm/model_executor/models/extract_hidden_states.py
@@
 def basic_cache(
     to_cache: torch.Tensor,  # shape: [seq_len, num_heads, head_size]
     kv_cache: torch.Tensor,  # shape: [num_blocks, block_size, num_heads, head_size]
     slot_mapping: torch.Tensor,  # shape: [seq_len]
 ):
     block_size = kv_cache.shape[1]
-    kv_cache[slot_mapping // block_size, slot_mapping % block_size] = to_cache
+    valid = slot_mapping >= 0
+    kv_cache[
+        slot_mapping[valid] // block_size,
+        slot_mapping[valid] % block_size,
+    ] = to_cache[valid]

Expected behavior

Padded slot mappings should be ignored by the hidden-state cache update. CUDA graph padding should not mutate any real KV-cache slot or alter saved connector hidden states.

Actual behavior

Padded -1 slot mappings write to the real last slot of the cache. In this repro, the connector saves row 63 of the first prompt as zeros under CUDA graph execution, while eager execution and the masked local patch both save the expected hidden states.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Padded slot mappings should be ignored by the hidden-state cache update. CUDA graph padding should not mutate any real KV-cache slot or alter saved connector hidden states.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: extract_hidden_states CUDA graph padding writes PADDING_SLOT_ID=-1 into last hidden-state KV slot