vllm - ✅(Solved) Fix [Feature]: Built-in debug tensor dump for intermediate activations [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36502Fetched 2026-04-08 00:36:29
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

When debugging model accuracy issues (e.g. comparing outputs across different frameworks like SGLang vs vLLM, or before/after quantization), it is extremely useful to dump intermediate activations from every layer. Currently there is no built-in way to do this in vLLM — users have to manually patch gpu_model_runner.py to inject forward hooks, which is fragile and breaks across versions.

  • SGLang's tensor_dump_forward_hook.py provides similar functionality.
    • This was previously done via external monkey-patching scripts, which break across vLLM versions.

PR fix notes

PR #36576: [Feature] Add debug tensor dump for intermediate activations

Description (problem / solution / changelog)

Summary

  • Add built-in support for dumping intermediate activations during inference, gated by environment variables
  • When VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER is set, forward hooks are registered on every leaf module to capture activations into .pt files
  • torch.compile and CUDAGraph are automatically disabled when enabled; zero overhead when disabled

Closes #36502

Changes

  • vllm/debug/tensor_dump.py: TensorDumper class and maybe_setup_tensor_dump() entry point
  • vllm/envs.py: Add VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER, VLLM_DEBUG_TENSOR_DUMP_LAYERS, VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES
  • vllm/v1/worker/gpu_model_runner.py: Call maybe_setup_tensor_dump() in load_model()
  • tests/debug/test_tensor_dump.py: Unit tests (13 cases)
  • docs/features/tensor_dump.md: Usage documentation

Test plan

  • pre-commit run --files all passed
  • pytest tests/debug/test_tensor_dump.py -v 13/13 passed on GPU server
  • End-to-end: VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

Related

  • Inspired by SGLang's tensor_dump_forward_hook.py
  • Useful for cross-framework activation comparison and post-quantization debugging

Changed files

  • docs/features/tensor_dump.md (added, +66/-0)
  • tests/debug/__init__.py (added, +2/-0)
  • tests/debug/test_tensor_dump.py (added, +154/-0)
  • vllm/debug/__init__.py (added, +2/-0)
  • vllm/debug/tensor_dump.py (added, +330/-0)
  • vllm/envs.py (modified, +28/-0)
  • vllm/v1/worker/gpu_model_runner.py (modified, +5/-0)

Code Example

VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump \
      python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B

---

dump/
    TP0_PP0_Rank0_pid12345/
      Pass00000.pt   # dict mapping module names -> cpu tensors
      Pass00001.pt
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Motivation

When debugging model accuracy issues (e.g. comparing outputs across different frameworks like SGLang vs vLLM, or before/after quantization), it is extremely useful to dump intermediate activations from every layer. Currently there is no built-in way to do this in vLLM — users have to manually patch gpu_model_runner.py to inject forward hooks, which is fragile and breaks across versions.

SGLang has a similar feature (tensor_dump_forward_hook.py) that has proven very useful in practice. I'd like to propose adding native support in vLLM.

Proposed Design

Environment Variables (opt-in, zero overhead when disabled)

VariableTypeDefaultDescription
VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDERstr(unset)Output directory. Feature is disabled when unset.
VLLM_DEBUG_TENSOR_DUMP_LAYERSstr(unset)Comma-separated layer indices to dump (e.g. "0,1,31"). All layers when unset.
VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSESint0Number of initial forward passes to skip (useful for skipping warmup).

Usage

VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump \
    python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B

Output structure:

  dump/
    TP0_PP0_Rank0_pid12345/
      Pass00000.pt   # dict mapping module names -> cpu tensors
      Pass00001.pt

Implementation Overview

  • New module vllm/debug/tensor_dump.py with a TensorDumper class that registers register_forward_hook on leaf modules.
  • A single call site in gpu_model_runner.py → load_model(), gated by the env var check.
  • When enabled, torch.compile and CUDAGraph are automatically disabled (they are incompatible with .cpu() / torch.save inside hooks).
  • The custom call injected by @support_torch_compile is removed so that nn.Module.call() properly fires hooks.

Key Considerations

  • Zero cost when disabled: the only added code in the hot path is a single env var check during load_model().
  • Performance when enabled: tensors are copied to CPU inside hooks, so this is strictly a debug tool, not for production.
  • Multi-GPU: output is organized by TP/PP rank and PID.

Draft Implementation

I have a working implementation ready. Happy to open a PR once the design direction is confirmed.

Related Work

  • SGLang's tensor_dump_forward_hook.py provides similar functionality.
  • This was previously done via external monkey-patching scripts, which break across vLLM versions.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the proposed design, follow these steps:

  • Create a new module vllm/debug/tensor_dump.py with a TensorDumper class that registers register_forward_hook on leaf modules.
  • In gpu_model_runner.py, add a call site in load_model() to instantiate TensorDumper when the environment variable VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER is set.
  • Disable torch.compile and CUDAGraph when the debug feature is enabled.

Example code:

# vllm/debug/tensor_dump.py
import os
import torch
from torch import nn

class TensorDumper:
    def __init__(self, output_folder, layers, skip_passes):
        self.output_folder = output_folder
        self.layers = layers
        self.skip_passes = skip_passes
        self.pass_count = 0

    def register_hooks(self, model):
        for name, module in model.named_modules():
            if isinstance(module, nn.Module):
                module.register_forward_hook(self.forward_hook)

    def forward_hook(self, module, input, output):
        if self.pass_count < self.skip_passes:
            self.pass_count += 1
            return
        layer_index = list(module.named_modules()).index((module,))
        if self.layers is None or str(layer_index) in self.layers:
            output_folder = os.path.join(self.output_folder, f"TP{torch.distributed.get_rank()} PP{torch.distributed.get_world_size()}_Rank{torch.distributed.get_rank()}_pid{os.getpid()}")
            os.makedirs(output_folder, exist_ok=True)
            torch.save({module.__class__.__name__: output.cpu()}, os.path.join(output_folder, f"Pass{self.pass_count:05d}.pt"))

# gpu_model_runner.py
import os
from vllm.debug.tensor_dump import TensorDumper

def load_model():
    # ...
    if 'VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER' in os.environ:
        output_folder = os.environ['VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER']
        layers = os.environ.get('VLLM_DEBUG_TENSOR_DUMP_LAYERS', None)
        if layers is not None:
            layers = [int(i) for i in layers.split(',')]
        skip_passes = int(os.environ.get('VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES', 0))
        tensor_dumper = TensorDumper(output_folder, layers, skip_passes)
        tensor_dumper.register_hooks(model)
        # Disable torch.compile and CUDAGraph
        torch.compile(False)
        # ...

Verification

To verify that the fix worked, set the environment variable VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER to a directory and run the model. The intermediate activations should be dumped to the specified directory.

Extra Tips

  • Make sure

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Feature]: Built-in debug tensor dump for intermediate activations [1 pull requests, 1 participants]