vllm - ✅(Solved) Fix [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends [1 pull requests, 1 participants]

vllm2026-03-09 20:31:06

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#36537•Fetched 2026-04-08 00:36:20

View on GitHub

Comments

Participants

Timeline

Reactions

Author

kmabeeTT

Participants

kmabeeTT

Timeline (top)

cross-referenced ×2labeled ×1mentioned ×1referenced ×1

Root Cause

This was bisected across ~471 commits between v0.15.1 and v0.16.0 to identify the regression:

Last good: 74898a701 — [BugFix][LoRA] TritonExperts (3.7 GB peak RSS)
First bad: f857a03f6 — [QeRL] Layerwise Reloading (#32133) (8.5 GB peak RSS)

record_metadata_for_reloading() is called from initialize_model() in model_executor/model_loader/utils.py. It iterates over every module and calls capture_layer_to_meta(), which:

Calls tensor.data.to("meta") on every parameter
Copies tensor.__dict__ (containing vLLM parameter attributes like weight_loader, output_dim, etc.) to the meta tensor
Stores these meta tensor copies in LAYERWISE_INFO (a WeakKeyDictionary)

On torch_xla, these additional tensor references and __dict__ copies cause the XLA dynamo bridge to create significantly more tensor copies during graph tracing. The effect scales linearly with model size.

On GPU with eager or aot_eager backends, we did not observe a measurable memory difference during compilation — the severe memory impact appears specific to torch_xla's graph capture mechanism. However, the unconditional metadata capture is still unnecessary overhead for all non-QeRL users.

Fix Action

Fix / Workaround

The regression can be verified by running vLLM inference on any XLA device and comparing peak host RSS with and without the following workaround:

# Monkey-patch to disable record_metadata_for_reloading
import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

PR fix notes

PR #36543: [Bug Fix] Defer record_metadata_for_reloading to avoid ~3x host memory regression

Repository: vllm-project/vllm
Author: kmabeeTT
State: open | merged: False
Link: https://github.com/vllm-project/vllm/pull/36543

Description (problem / solution / changelog)

Purpose

Fix https://github.com/vllm-project/vllm/issues/36537

record_metadata_for_reloading() runs unconditionally during initialize_model() for all users, even though it only benefits QeRL layerwise weight reloading. On torch_xla backends, this causes a ~3x host memory regression during torch.compile tracing (e.g. 7.5 GB → 22 GB for Qwen3-1.7B, up to ~435 GB for Qwen3-32B).

This PR moves record_metadata_for_reloading() from initialize_model() to initialize_layerwise_reload(), so metadata is captured on-demand when reload_weights() is first called rather than at model init.

Test Plan

Tested on Tenstorrent hardware (torch_xla PJRT backend) by running vLLM inference with Qwen3 models and measuring peak host RSS with /usr/bin/time -v.

Note: the severe memory impact is specific to torch_xla's graph capture mechanism. We were unable to reproduce on GPU with eager/aot_eager backends. QeRL reload tests should be verified by the QeRL team (@kylesayrs).

Test Result

Peak host RSS (lower is better):

Model	Before (v0.16.0)	After (this PR)
Qwen3-0.6B	8.5 GB	3.9 GB
Qwen3-1.7B	22 GB	8.0 GB
Qwen3-4B	49.6 GB	~17 GB
Qwen3-32B	~435 GB	~128 GB

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update.
(Optional) Release notes update.

</details>

Changed files

vllm/model_executor/model_loader/reload/layerwise.py (modified, +10/-4)
vllm/model_executor/model_loader/utils.py (modified, +0/-3)

Code Example

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : 17.0.6 (++20231209124227+6009708b4367-1~exp1~20231209124336.77)
CMake version                : version 4.2.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-141-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.9.1
[pip3] torch-xla==2.9.0+git8ee513e
[pip3] transformers==4.57.6
[pip3] triton==3.5.1

==============================
         vLLM Info
==============================
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

==============================
     Environment Variables
==============================
VLLM_TARGET_DEVICE=empty
TORCHINDUCTOR_COMPILE_THREADS=1

---

# Monkey-patch to disable record_metadata_for_reloading
import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

RAW_BUFFERClick to expand / collapse

Your current environment

vLLM Version: 0.16.0 (also present in 0.17.0)
PyTorch version: 2.9.1
torch-xla: 2.9.0+git8ee513e
Python version: 3.12.12
OS: Ubuntu 22.04.5 LTS (x86_64)
Hardware: Tenstorrent Wormhole (n300) via torch_xla PJRT backend

<details> <summary>The output of <code>python collect_env.py</code></summary>

Collecting environment information...
==============================
        System Info
==============================
OS                           : Ubuntu 22.04.5 LTS (x86_64)
GCC version                  : (Ubuntu 11.4.0-1ubuntu1~22.04.3) 11.4.0
Clang version                : 17.0.6 (++20231209124227+6009708b4367-1~exp1~20231209124336.77)
CMake version                : version 4.2.1
Libc version                 : glibc-2.35

==============================
       PyTorch Info
==============================
PyTorch version              : 2.9.1+cu128
Is debug build               : False
CUDA used to build PyTorch   : 12.8
ROCM used to build PyTorch   : N/A

==============================
      Python Environment
==============================
Python version               : 3.12.12 (main, Oct 10 2025, 08:52:57) [GCC 11.4.0] (64-bit runtime)
Python platform              : Linux-5.15.0-141-generic-x86_64-with-glibc2.35

==============================
       CUDA / GPU Info
==============================
Is CUDA available            : False
CUDA runtime version         : No CUDA
CUDA_MODULE_LOADING set to   : N/A
GPU models and configuration : No CUDA
Nvidia driver version        : No CUDA
cuDNN version                : No CUDA
HIP runtime version          : N/A
MIOpen runtime version       : N/A
Is XNNPACK available         : True

==============================
Versions of relevant libraries
==============================
[pip3] numpy==2.2.6
[pip3] torch==2.9.1
[pip3] torch-xla==2.9.0+git8ee513e
[pip3] transformers==4.57.6
[pip3] triton==3.5.1

==============================
         vLLM Info
==============================
vLLM Version                 : 0.16.0
vLLM Build Flags:
  CUDA Archs: Not Set; ROCm: Disabled

==============================
     Environment Variables
==============================
VLLM_TARGET_DEVICE=empty
TORCHINDUCTOR_COMPILE_THREADS=1

</details>

🐛 Describe the bug

record_metadata_for_reloading() (introduced in PR #32133 FYI @kylesayrs ) runs unconditionally during initialize_model() for all users, even though it only benefits the QeRL layerwise weight reloading use case. On torch_xla backends, this causes a 2-3x host memory regression during torch.compile tracing. The regression scales with model size, causing OOM in some cases on machines that do not have 500+ GB of ram.

Even outside of the XLA-specific memory impact, record_metadata_for_reloading does unnecessary work at model initialization for the vast majority of users who never call reload_weights(). It iterates every module, creates meta tensor copies via tensor.data.to("meta"), and copies __dict__ on every parameter — all eagerly, with no way to opt out.

Impact

Measured with Qwen3 models via torch_xla + PJRT backend (peak host RSS):

Model	v0.15.1 (before PR #32133)	v0.16.0 (after)	v0.16.0 + fix
Qwen3-0.6B	3.7 GB	8.5 GB (2.3x)	3.7 GB
Qwen3-1.7B	7.5 GB	22 GB (2.9x)	8.0 GB
Qwen3-4B	16.5 GB	49.6 GB (3.0x)	~17 GB
Qwen3-32B	~150 GB	~435 GB (2.9x)	~128 GB

Root cause

This was bisected across ~471 commits between v0.15.1 and v0.16.0 to identify the regression:

Last good: 74898a701 — [BugFix][LoRA] TritonExperts (3.7 GB peak RSS)
First bad: f857a03f6 — [QeRL] Layerwise Reloading (#32133) (8.5 GB peak RSS)

record_metadata_for_reloading() is called from initialize_model() in model_executor/model_loader/utils.py. It iterates over every module and calls capture_layer_to_meta(), which:

Calls tensor.data.to("meta") on every parameter
Copies tensor.__dict__ (containing vLLM parameter attributes like weight_loader, output_dim, etc.) to the meta tensor
Stores these meta tensor copies in LAYERWISE_INFO (a WeakKeyDictionary)

Reproduction

The regression requires vLLM's full model loading path (which creates BasevLLMParameter subclasses with populated __dict__) running on a torch_xla backend. We were unable to create a standalone GPU/CPU repro because the severe memory impact is specific to torch_xla's graph capture.

The regression can be verified by running vLLM inference on any XLA device and comparing peak host RSS with and without the following workaround:

# Monkey-patch to disable record_metadata_for_reloading
import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

Possible/Suggested fix (tested, confirmed working)

Move record_metadata_for_reloading(model) from initialize_model() to initialize_layerwise_reload(), so metadata is captured on-demand when reload_weights() is first called rather than unconditionally at model init. Tested out on our side, and it solves the problem.

Going to see about putting up a PR for this.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To solve the memory regression issue caused by record_metadata_for_reloading() in torch_xla backends, follow these steps:

Identify the problematic function call: Locate the record_metadata_for_reloading(model) call in initialize_model().
Move the function call: Move record_metadata_for_reloading(model) from initialize_model() to initialize_layerwise_reload(), so metadata is captured on-demand when reload_weights() is first called.
Verify the fix: Run vLLM inference on an XLA device and compare peak host RSS with and without the fix to ensure the memory regression is resolved.

Example code changes:

# Before
def initialize_model(model):
    # ...
    record_metadata_for_reloading(model)
    # ...

# After
def initialize_layerwise_reload(model):
    record_metadata_for_reloading(model)
    # ...

Alternatively, you can use a monkey patch as a temporary workaround:

import vllm.model_executor.model_loader.utils as loader_utils
loader_utils.record_metadata_for_reloading = lambda model: None

Verification

To verify the fix, run vLLM inference on an XLA device and compare peak host RSS with and without the fix. You can use tools like psutil or memory_profiler to monitor memory usage.

Extra Tips

Make sure to test the fix on different model sizes and XLA devices to ensure the memory regression is resolved in all cases.
Consider adding a flag or configuration option to enable/disable metadata capture for users who may not need it.
Keep in mind that this fix only addresses the memory regression issue in torch_xla backends and may not affect other backends like GPU or CPU.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #database connection #vector store #embedding generation #model loading #environment variable

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - ✅(Solved) Fix [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #36543: [Bug Fix] Defer record_metadata_for_reloading to avoid ~3x host memory regression

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Impact

Root cause

Reproduction

Possible/Suggested fix (tested, confirmed working)

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - ✅(Solved) Fix [Bug]: record_metadata_for_reloading causes ~3x host memory regression during torch.compile on XLA backends [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

PR fix notes

PR #36543: [Bug Fix] Defer record_metadata_for_reloading to avoid ~3x host memory regression

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

Code Example

Your current environment

🐛 Describe the bug

Impact

Root cause

Reproduction

Possible/Suggested fix (tested, confirmed working)

Before submitting a new issue...

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING