vllm - ✅(Solved) Fix [Bug]: torch.opcheck fails for `_C.rms_norm_per_block_quant` [2 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36688Fetched 2026-04-08 00:35:25
View on GitHub
Comments
3
Participants
2
Timeline
93
Reactions
0
Participants
Assignees
Timeline (top)
referenced ×80commented ×3cross-referenced ×2labeled ×2

Error Message

torch.testing._internal.optests.generate_tests.OpCheckError: opcheck(op, ...): test_schema failed with Argument weight is not defined as mutable but was mutated (scroll up for stack trace)

Root Cause

In the unit test for torch.ops._C.rms_norm_per_block_quant custom kernel, for some reason opcheck fails because it thinks the weight tensor got mutated. A closer look reveals a weird issue: the cloned weight arg is the one that gets modified, and the original weight arg stays intact. I could not find a memory issue, I manually confirmed the original weight stays intact when not using opcheck, and E2E evals look good.

Fix Action

Fixed

PR fix notes

PR #36766: fix(test): skip test_schema in opcheck for rms_norm_per_block_quant (…

Description (problem / solution / changelog)

…#36688)

opcheck's test_schema falsely reports the immutable weight tensor as mutated due to CUDA memory-allocator reuse when it internally clones the arguments. The kernel only reads weight through const pointers, and the original tensor stays intact.

Fix:

  • Add opcheck call for rms_norm_per_block_quant (block-quant path) with correctly shaped scales tensor.
  • Exclude test_schema from that opcheck to work around the false positive; test_autograd_registration and test_faketensor still run.
  • Move the existing rms_norm_dynamic_per_token_quant opcheck into the else branch so it only runs for the per-token path.
<!-- markdownlint-disable -->

Changed files

  • tests/kernels/core/test_fused_quant_layernorm.py (modified, +59/-8)

PR #36779: [Bugfix] opcheck false mutation error in rms_norm_per_block_quant (#36688)

Description (problem / solution / changelog)

Fixes #36688.

The opcheck call in test_rms_norm was allocating scales with shape (num_tokens, 1) and passing it to rms_norm_per_block_quant. The blockwise kernel writes hidden_size / group_size scale values per token, so with 8 groups the buffer was 8× too small. The out-of-bounds writes landed in the adjacent cloned weight tensor under opcheck's memory layout, which is why opcheck reported weight as mutated even though nothing in the kernel intentionally touches it.

The schema and C++ declaration are correct - weight is and should remain immutable.

Changes:

  • Allocate block_scales with the correct shape (num_tokens, hidden_size // group_size) in the test, and add the missing opcheck call for the blockwise path
  • Add a TORCH_CHECK in rms_norm_per_block_quant validating scales.numel() >= num_tokens * num_groups so callers get a clear error instead of silent OOB writes

Testing: Ran 16 blockwise int8 opcheck cases: group_size 64 and 128, with and without residual, 1 and 2048 tokens-> all pass. The pre-existing FP8 failures in this file reproduce identically on upstream main before any edits (unrelated binary/source API mismatch). <img width="802" height="264" alt="image" src="https://github.com/user-attachments/assets/b0136b52-11fd-4be8-bda9-f6468f347ee5" />

Changed files

  • csrc/quantization/fused_kernels/fused_layernorm_dynamic_per_token_quant.cu (modified, +9/-0)
  • tests/kernels/core/test_fused_quant_layernorm.py (modified, +10/-9)

Code Example

Your output of `python collect_env.py` here

---

torch.testing._internal.optests.generate_tests.OpCheckError: opcheck(op, ...): test_schema failed with Argument weight is not defined as mutable but was mutated (scroll up for stack trace)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

In the unit test for torch.ops._C.rms_norm_per_block_quant custom kernel, for some reason opcheck fails because it thinks the weight tensor got mutated. A closer look reveals a weird issue: the cloned weight arg is the one that gets modified, and the original weight arg stays intact. I could not find a memory issue, I manually confirmed the original weight stays intact when not using opcheck, and E2E evals look good.

torch.testing._internal.optests.generate_tests.OpCheckError: opcheck(op, ...): test_schema failed with Argument weight is not defined as mutable but was mutated (scroll up for stack trace)

https://github.com/vllm-project/vllm/blob/0ebf4e969b43d99c240fd085703ea1ed97897499/tests/kernels/core/test_fused_quant_layernorm.py#L291-L304

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

The issue can be fixed by ensuring that the weight tensor is not modified in place.

  • Clone the weight tensor before passing it to the custom kernel to prevent modification of the original tensor.
  • Use the cloned tensor for the custom kernel operation.

Example code:

import torch

# Assuming weight is the original weight tensor
weight_clone = weight.clone()

# Pass the cloned weight to the custom kernel
torch.ops._C.rms_norm_per_block_quant(weight_clone, ...)

# Verify that the original weight remains unchanged
assert torch.equal(weight, weight_clone)

Alternatively, you can also use the detach() method to create a new tensor that shares the same storage as the original tensor but has its own copy of the storage:

weight_clone = weight.detach().clone()

However, using clone() alone should be sufficient to fix the issue.

Verification

To verify that the fix worked, run the unit test again and check that the OpCheckError is no longer raised. You can also add additional assertions to ensure that the original weight tensor remains unchanged after the custom kernel operation.

Extra Tips

  • When working with custom kernels, it's essential to ensure that the input tensors are not modified in place to avoid unexpected behavior.
  • Using clone() or detach().clone() can help prevent modification of the original tensors.
  • Always verify the correctness of the custom kernel operation by adding additional assertions or tests.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: torch.opcheck fails for `_C.rms_norm_per_block_quant` [2 pull requests, 3 comments, 2 participants]