vllm - ✅(Solved) Fix [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36537Fetched 2026-04-08 00:36:20
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×2labeled ×1mentioned ×1referenced ×1

Root Cause

This was bisected across ~471 commits between v0.15.1 and v0.16.0 to identify the regression:

  • Last good: 74898a701[BugFix][LoRA] TritonExperts (3.7 GB peak RSS)
  • First bad: f857a03f6[QeRL] Layerwise Reloading (#32133) (8.5 GB peak RSS)

record_metadata_for_reloading() is called from initialize_model() in model_executor/model_loader/utils.py. It iterates over every module and calls capture_layer_to_meta(), which:

  1. Calls tensor.data.to("meta") on every parameter
  2. Copies tensor.__dict__ (containing vLLM parameter attributes like weight_loader, output_dim, etc.) to the meta tensor
  3. Stores these meta tensor copies in LAYERWISE_INFO (a WeakKeyDictionary)

On torch_xla, these additional tensor references and __dict__ copies cause the XLA dynamo bridge to create significantly more tensor copies during graph tracing. The effect scales linearly with model size.

On GPU with eager or aot_eager backends, we did not observe a measurable memory difference during compilation — the severe memory impact appears specific to torch_xla's graph capture mechanism. However, the unconditional metadata capture is still unnecessary overhead for all non-QeRL users.

Fix Action

Fix / Workaround

The regression can be verified by running vLLM inference on any XLA device and comparing peak host RSS with and without the following workaround:

# Monkey-patch to disable record_metadata_for_reloading
import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

PR fix notes

PR #36543: [Bug Fix] Defer record_metadata_for_reloading to avoid ~3x host memory regression

Description (problem / solution / changelog)

Purpose

Fix https://github.com/vllm-project/vllm/issues/36537

record_metadata_for_reloading() runs unconditionally during initialize_model() for all users, even though it only benefits QeRL layerwise weight reloading. On torch_xla backends, this causes a ~3x host memory regression during torch.compile tracing (e.g. 7.5 GB → 22 GB for Qwen3-1.7B, up to ~435 GB for Qwen3-32B).

This PR moves record_metadata_for_reloading() from initialize_model() to initialize_layerwise_reload(), so metadata is captured on-demand when reload_weights() is first called rather than at model init.

Test Plan

Tested on Tenstorrent hardware (torch_xla PJRT backend) by running vLLM inference with Qwen3 models and measuring peak host RSS with /usr/bin/time -v.

Note: the severe memory impact is specific to torch_xla's graph capture mechanism. We were unable to reproduce on GPU with eager/aot_eager backends. QeRL reload tests should be verified by the QeRL team (@kylesayrs).

Test Result

Peak host RSS (lower is better):

ModelBefore (v0.16.0)After (this PR)
Qwen3-0.6B8.5 GB3.9 GB
Qwen3-1.7B22 GB8.0 GB
Qwen3-4B49.6 GB~17 GB
Qwen3-32B~435 GB~128 GB

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update.
  • (Optional) Release notes update.
</details>

Changed files

  • vllm/model_executor/model_loader/reload/layerwise.py (modified, +10/-4)
  • vllm/model_executor/model_loader/utils.py (modified, +0/-3)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : 17.0.6 (++20231209124227+6009708b4367-1~exp1~20231209124336.77)
CMake version                : version 4.2.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-141-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.9.1
[pip3] torch-xla==2.9.0+git8ee513e
[pip3] transformers==4.57.6
[pip3] triton==3.5.1

==============================
         vLLM Info
==============================
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

==============================
     Environment Variables
==============================
VLLM_TARGET_DEVICE=empty
TORCHINDUCTOR_COMPILE_THREADS=1

---

# Monkey-patch to disable record_metadata_for_reloading
import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM Version: 0.16.0 (also present in 0.17.0)
  • PyTorch version: 2.9.1
  • torch-xla: 2.9.0+git8ee513e
  • Python version: 3.12.12
  • OS: Ubuntu 22.04.5 LTS (x86_64)
  • Hardware: Tenstorrent Wormhole (n300) via torch_xla PJRT backend
<details> <summary>The output of <code>python collect_env.py</code></summary>
Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : 17.0.6 (++20231209124227+6009708b4367-1~exp1~20231209124336.77)
CMake version                : version 4.2.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-141-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.9.1
[pip3] torch-xla==2.9.0+git8ee513e
[pip3] transformers==4.57.6
[pip3] triton==3.5.1

==============================
         vLLM Info
==============================
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

==============================
     Environment Variables
==============================
VLLM_TARGET_DEVICE=empty
TORCHINDUCTOR_COMPILE_THREADS=1
</details>

🐛 Describe the bug

record_metadata_for_reloading() (introduced in PR #32133 FYI @kylesayrs ) runs unconditionally during initialize_model() for all users, even though it only benefits the QeRL layerwise weight reloading use case. On torch_xla backends, this causes a 2-3x host memory regression during torch.compile tracing. The regression scales with model size, causing OOM in some cases on machines that do not have 500+ GB of ram.

Even outside of the XLA-specific memory impact, record_metadata_for_reloading does unnecessary work at model initialization for the vast majority of users who never call reload_weights(). It iterates every module, creates meta tensor copies via tensor.data.to("meta"), and copies __dict__ on every parameter — all eagerly, with no way to opt out.

Impact

Measured with Qwen3 models via torch_xla + PJRT backend (peak host RSS):

Modelv0.15.1 (before PR #32133)v0.16.0 (after)v0.16.0 + fix
Qwen3-0.6B3.7 GB8.5 GB (2.3x)3.7 GB
Qwen3-1.7B7.5 GB22 GB (2.9x)8.0 GB
Qwen3-4B16.5 GB49.6 GB (3.0x)~17 GB
Qwen3-32B~150 GB~435 GB (2.9x)~128 GB

Root cause

This was bisected across ~471 commits between v0.15.1 and v0.16.0 to identify the regression:

  • Last good: 74898a701[BugFix][LoRA] TritonExperts (3.7 GB peak RSS)
  • First bad: f857a03f6[QeRL] Layerwise Reloading (#32133) (8.5 GB peak RSS)

record_metadata_for_reloading() is called from initialize_model() in model_executor/model_loader/utils.py. It iterates over every module and calls capture_layer_to_meta(), which:

  1. Calls tensor.data.to("meta") on every parameter
  2. Copies tensor.__dict__ (containing vLLM parameter attributes like weight_loader, output_dim, etc.) to the meta tensor
  3. Stores these meta tensor copies in LAYERWISE_INFO (a WeakKeyDictionary)

On torch_xla, these additional tensor references and __dict__ copies cause the XLA dynamo bridge to create significantly more tensor copies during graph tracing. The effect scales linearly with model size.

On GPU with eager or aot_eager backends, we did not observe a measurable memory difference during compilation — the severe memory impact appears specific to torch_xla's graph capture mechanism. However, the unconditional metadata capture is still unnecessary overhead for all non-QeRL users.

Reproduction

The regression requires vLLM's full model loading path (which creates BasevLLMParameter subclasses with populated __dict__) running on a torch_xla backend. We were unable to create a standalone GPU/CPU repro because the severe memory impact is specific to torch_xla's graph capture.

The regression can be verified by running vLLM inference on any XLA device and comparing peak host RSS with and without the following workaround:

# Monkey-patch to disable record_metadata_for_reloading
import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

Possible/Suggested fix (tested, confirmed working)

Move record_metadata_for_reloading(model) from initialize_model() to initialize_layerwise_reload(), so metadata is captured on-demand when reload_weights() is first called rather than unconditionally at model init. Tested out on our side, and it solves the problem.

Going to see about putting up a PR for this.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To solve the memory regression issue caused by record_metadata_for_reloading() in torch_xla backends, follow these steps:

  1. Identify the problematic function call: Locate the record_metadata_for_reloading(model) call in initialize_model().
  2. Move the function call: Move record_metadata_for_reloading(model) from initialize_model() to initialize_layerwise_reload(), so metadata is captured on-demand when reload_weights() is first called.
  3. Verify the fix: Run vLLM inference on an XLA device and compare peak host RSS with and without the fix to ensure the memory regression is resolved.

Example code changes:

# Before
def initialize_model(model):
    # ...
    record_metadata_for_reloading(model)
    # ...

# After
def initialize_layerwise_reload(model):
    record_metadata_for_reloading(model)
    # ...

Alternatively, you can use a monkey patch as a temporary workaround:

import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

Verification

To verify the fix, run vLLM inference on an XLA device and compare peak host RSS with and without the fix. You can use tools like psutil or memory_profiler to monitor memory usage.

Extra Tips

  • Make sure to test the fix on different model sizes and XLA devices to ensure the memory regression is resolved in all cases.
  • Consider adding a flag or configuration option to enable/disable metadata capture for users who may not need it.
  • Keep in mind that this fix only addresses the memory regression issue in torch_xla backends and may not affect other backends like GPU or CPU.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends [1 pull requests, 1 participants]