vllm - ✅(Solved) Fix [Feature]: Built-in debug tensor dump for intermediate activations [1 pull requests, 1 participants]

MOIPA · 2026-03-09T13:31:55Z

[vllm] PR 36576: Feature Add debug tensor dump for intermediate activations - Repository: vllm-project/vllm - Author: MOIPA - State: open | merged: False - Lin… # PR #36576: [Feature] Add debug tensor dump for intermediate activations - Repository: vllm-project/vllm - Author: MOIPA - State: open | merged: False - Link: https://github.com/vllm-project/vllm/pull/36576 ## Description (problem / solution / changelog) ## Summary - Add built-in support for dumping intermediate activations during inference, gated by environment variables - When `VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER` is set, forward hooks are registered on every leaf module to capture activations into `.pt` files - `torch.compile` and CUDAGraph are automatically disabled when enabled; zero overhead when disabled Closes #36502 ## Changes - **`vllm/debug/tensor_dump.py`**: `TensorDumper` class and `maybe_setup_tensor_dump()` entry point - **`vllm/envs.py`**: Add `VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER`, `VLLM_DEBUG_TENSOR_DUMP_LAYERS`, `VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES` - **`vllm/v1/worker/gpu_model_runner.py`**: Call `maybe_setup_tensor_dump()` in `load_model()` - **`tests/debug/test_tensor_dump.py`**: Unit tests (13 cases) - **`docs/features/tensor_dump.md`**: Usage documentation ## Test plan - [x] `pre-commit run --files` all passed - [x] `pytest tests/debug/test_tensor_dump.py -v` 13/13 passed on GPU server - [ ] End-to-end: `VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0` ## Related - Inspired by SGLang's `tensor_dump_forward_hook.py` - Useful for cross-framework activation comparison and post-quantization debugging ## Changed files - `docs/features/tensor_dump.md` (added, +66/-0) - `tests/debug/__init__.py` (added, +2/-0) - `tests/debug/test_tensor_dump.py` (added, +154/-0) - `vllm/debug/__init__.py` (added, +2/-0) - `vllm/debug/tensor_dump.py` (added, +330/-0) - `vllm/envs.py` (modified, +28/-0) - `vllm/v1/worker/gpu_model_runner.py` (modified, +5/-0) ## Fix / Workaround When debugging model accuracy issues (e.g. comparing outputs across different frameworks like SGLang vs vLLM, or before/after quantization), it is extremely useful to dump intermediate activations from every layer. Currently there is no built-in way to do this in vLLM — users have to manually patch `gpu_model_runner.py` to inject forward hooks, which is fragile and breaks across versions. - SGLang's tensor_dump_forward_hook.py provides similar functionality. - This was previously done via external monkey-patching scripts, which break across vLLM versions. ### 🚀 The feature, motivation and pitch ## Motivation When debugging model accuracy issues (e.g. comparing outputs across different frameworks like SGLang vs vLLM, or before/after quantization), it is extremely useful to dump intermediate activations from every layer. Currently there is no built-in way to do this in vLLM — users have to manually patch `gpu_model_runner.py` to inject forward hooks, which is fragile and breaks across versions. SGLang has a similar feature (`tensor_dump_forward_hook.py`) that has proven very useful in practice. I'd like to propose adding native support in vLLM. ## Proposed Design ### Environment Variables (opt-in, zero overhead when disabled) | Variable | Type | Default | Description | |---|---|---|---| | `VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER` | `str` | (unset) | Output directory. Feature is disabled when unset. | | `VLLM_DEBUG_TENSOR_DUMP_LAYERS` | `str` | (unset) | Comma-separated layer indices to dump (e.g. `"0,1,31"`). All layers when unset. | | `VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES` | `int` | `0` | Number of initial forward passes to skip (useful for skipping warmup). | ### Usage ```bash VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump \ python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B ``` ### Output structure: ``` dump/ TP0_PP0_Rank0_pid12345/ Pass00000.pt # dict mapping module names -> cpu tensors Pass00001.pt ``` ### Implementation Overview - New module vllm/debug/tensor_dump.py with a TensorDumper class that registers register_forward_hook on leaf modules. - A single call site in gpu_model_runner.py → load_model(), gated by the env var check. - When enabled, torch.compile and CUDAGraph are automatically disabled (they are incompatible with .cpu() / torch.save inside hooks). - The custom __call__ injected by @support_torch_compile is removed so that nn.Module.__call__() properly fires hooks. ### Key Considerations - Zero cost when disabled: the only added code in the hot path is a single env var check during load_model(). - Performance when enabled: tensors are copied to CPU inside hooks, so this is strictly a debug tool, not for production. - Multi-GPU: output is organized by TP/PP rank and PID. ### Draft Implementation I have a working implementation ready. Happy to open a PR once the design direction is confirmed. ### Related Work - SGLang's tensor_dump_forward_hook.py provides s

vllm2026-03-09 13:31:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36502•Fetched 2026-04-08 00:36:29

View on GitHub

Comments

Participants

Timeline

Reactions

Author

MOIPA

Participants

MOIPA

Timeline (top)

cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

When debugging model accuracy issues (e.g. comparing outputs across different frameworks like SGLang vs vLLM, or before/after quantization), it is extremely useful to dump intermediate activations from every layer. Currently there is no built-in way to do this in vLLM — users have to manually patch gpu_model_runner.py to inject forward hooks, which is fragile and breaks across versions.

SGLang's tensor_dump_forward_hook.py provides similar functionality.
- This was previously done via external monkey-patching scripts, which break across vLLM versions.

PR fix notes

PR #36576: [Feature] Add debug tensor dump for intermediate activations

Repository: vllm-project/vllm
Author: MOIPA
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36576

Description (problem / solution / changelog)

Summary

Add built-in support for dumping intermediate activations during inference, gated by environment variables
When VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER is set, forward hooks are registered on every leaf module to capture activations into .pt files
torch.compile and CUDAGraph are automatically disabled when enabled; zero overhead when disabled

Closes #36502

Changes

vllm/debug/tensor_dump.py: TensorDumper class and maybe_setup_tensor_dump() entry point
vllm/envs.py: Add VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER, VLLM_DEBUG_TENSOR_DUMP_LAYERS, VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES
vllm/v1/worker/gpu_model_runner.py: Call maybe_setup_tensor_dump() in load_model()
tests/debug/test_tensor_dump.py: Unit tests (13 cases)
docs/features/tensor_dump.md: Usage documentation

Test plan

pre-commit run --files all passed
pytest tests/debug/test_tensor_dump.py -v 13/13 passed on GPU server
End-to-end: VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump python -m vllm.entrypoints.openai.api_server --model TinyLlama/TinyLlama-1.1B-Chat-v1.0

Inspired by SGLang's tensor_dump_forward_hook.py
Useful for cross-framework activation comparison and post-quantization debugging

Changed files

docs/features/tensor_dump.md (added, +66/-0)
tests/debug/__init__.py (added, +2/-0)
tests/debug/test_tensor_dump.py (added, +154/-0)
vllm/debug/__init__.py (added, +2/-0)
vllm/debug/tensor_dump.py (added, +330/-0)
vllm/envs.py (modified, +28/-0)
vllm/v1/worker/gpu_model_runner.py (modified, +5/-0)

Code Example

VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump \
      python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B

---

dump/
    TP0_PP0_Rank0_pid12345/
      Pass00000.pt   # dict mapping module names -> cpu tensors
      Pass00001.pt

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Motivation

SGLang has a similar feature (tensor_dump_forward_hook.py) that has proven very useful in practice. I'd like to propose adding native support in vLLM.

Proposed Design

Environment Variables (opt-in, zero overhead when disabled)

Variable	Type	Default	Description
`VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER`	`str`	(unset)	Output directory. Feature is disabled when unset.
`VLLM_DEBUG_TENSOR_DUMP_LAYERS`	`str`	(unset)	Comma-separated layer indices to dump (e.g. `"0,1,31"`). All layers when unset.
`VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES`	`int`	`0`	Number of initial forward passes to skip (useful for skipping warmup).

Usage

VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER=./dump \
    python -m vllm.entrypoints.openai.api_server --model meta-llama/Llama-3.1-8B

Output structure:

  dump/
    TP0_PP0_Rank0_pid12345/
      Pass00000.pt   # dict mapping module names -> cpu tensors
      Pass00001.pt

Implementation Overview

New module vllm/debug/tensor_dump.py with a TensorDumper class that registers register_forward_hook on leaf modules.
A single call site in gpu_model_runner.py → load_model(), gated by the env var check.
When enabled, torch.compile and CUDAGraph are automatically disabled (they are incompatible with .cpu() / torch.save inside hooks).
The custom call injected by @support_torch_compile is removed so that nn.Module.call() properly fires hooks.

Key Considerations

Zero cost when disabled: the only added code in the hot path is a single env var check during load_model().
Performance when enabled: tensors are copied to CPU inside hooks, so this is strictly a debug tool, not for production.
Multi-GPU: output is organized by TP/PP rank and PID.

Draft Implementation

I have a working implementation ready. Happy to open a PR once the design direction is confirmed.

Related Work

SGLang's tensor_dump_forward_hook.py provides similar functionality.
This was previously done via external monkey-patching scripts, which break across vLLM versions.

Alternatives

No response

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To implement the proposed design, follow these steps:

Create a new module vllm/debug/tensor_dump.py with a TensorDumper class that registers register_forward_hook on leaf modules.
In gpu_model_runner.py, add a call site in load_model() to instantiate TensorDumper when the environment variable VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER is set.
Disable torch.compile and CUDAGraph when the debug feature is enabled.

Example code:

# vllm/debug/tensor_dump.py
import os
import torch
from torch import nn

class TensorDumper:
    def __init__(self, output_folder, layers, skip_passes):
        self.output_folder = output_folder
        self.layers = layers
        self.skip_passes = skip_passes
        self.pass_count = 0

    def register_hooks(self, model):
        for name, module in model.named_modules():
            if isinstance(module, nn.Module):
                module.register_forward_hook(self.forward_hook)

    def forward_hook(self, module, input, output):
        if self.pass_count < self.skip_passes:
            self.pass_count += 1
            return
        layer_index = list(module.named_modules()).index((module,))
        if self.layers is None or str(layer_index) in self.layers:
            output_folder = os.path.join(self.output_folder, f"TP{torch.distributed.get_rank()} PP{torch.distributed.get_world_size()}_Rank{torch.distributed.get_rank()}_pid{os.getpid()}")
            os.makedirs(output_folder, exist_ok=True)
            torch.save({module.__class__.__name__: output.cpu()}, os.path.join(output_folder, f"Pass{self.pass_count:05d}.pt"))

# gpu_model_runner.py
import os
from vllm.debug.tensor_dump import TensorDumper

def load_model():
    # ...
    if 'VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER' in os.environ:
        output_folder = os.environ['VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER']
        layers = os.environ.get('VLLM_DEBUG_TENSOR_DUMP_LAYERS', None)
        if layers is not None:
            layers = [int(i) for i in layers.split(',')]
        skip_passes = int(os.environ.get('VLLM_DEBUG_TENSOR_DUMP_SKIP_PASSES', 0))
        tensor_dumper = TensorDumper(output_folder, layers, skip_passes)
        tensor_dumper.register_hooks(model)
        # Disable torch.compile and CUDAGraph
        torch.compile(False)
        # ...

Verification

To verify that the fix worked, set the environment variable VLLM_DEBUG_TENSOR_DUMP_OUTPUT_FOLDER to a directory and run the model. The intermediate activations should be dumped to the specified directory.

Extra Tips

Make sure

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #dependency conflict #environment setup #docker error #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.