pytorch - ✅(Solved) Fix [Discussion] Disable size_asserts by default in inductor for vllm x compile inference [2 pull requests, 5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#177719Fetched 2026-04-08 00:57:09
View on GitHub
Comments
5
Participants
4
Timeline
121
Reactions
2
Timeline (top)
subscribed ×30mentioned ×29referenced ×21unsubscribed ×21

Error Message

// Error message: msg << "This error most often comes from a incorrect fake (aka meta) " 5. String formatting for error messages (allocated but never used)

Root Cause

It catch bugs in custom op meta kernels where the fake tensor implementation returns wrong output shapes. We can consider disable it during inference for vllm models because:

Fix Action

Fix / Workaround

Each assert_size_stride call:

  1. Python function dispatch overhead
  2. THPVariable_Unpack(item) — unwrap Python tensor
  3. tensor.size(i) + tensor.stride(i) — read metadata for each dimension
  4. THPUtils_unpackLong() — unpack expected values from Python tuples
  5. String formatting for error messages (allocated but never used)

PR fix notes

PR #37479: [Perf] Disable inductor size_asserts by default for serving performance

Description (problem / solution / changelog)

Inductor generates assert_size_stride() calls in compiled output code to validate tensor shapes and strides at runtime. For large models like DeepSeek-R1 671B with piecewise compilation, this results in ~340 C++ assert calls per forward pass, adding ~2ms overhead (~2.6% of TPOT at request rate 15).

These assertions are useful during development but unnecessary during production serving where tensor shapes are validated during the first compilation. This change disables size_asserts by default in vLLM's inductor compile config.

Users can re-enable assertions for debugging via: --compilation-config '{"inductor_compile_config": {"size_asserts": true}}'

PyTorch is also working on an assert-once solution upstream: https://github.com/pytorch/pytorch/issues/177719

<!-- markdownlint-disable -->

Purpose

Test Plan

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/config/compilation.py (modified, +10/-0)

PR #37485: [Perf] Disable inductor runtime asserts by default for serving perfor…

Description (problem / solution / changelog)

Inductor generates assert_size_stride() and assert_alignment() calls in compiled output code to validate tensor shapes, strides, and memory alignment at runtime. For large models like DeepSeek-R1 671B with piecewise compilation, this results in ~340 assert_size_stride + ~60 assert_alignment calls per forward pass, adding ~2ms overhead (~2.6% of TPOT at request rate 15).

These assertions are useful during development but unnecessary during production serving where tensor shapes are validated during the first compilation. This change disables size_asserts, alignment_asserts, and scalar_asserts by default in vLLM's inductor compile config.

  • size_asserts: validates tensor shape/stride on every call (~340 calls/fwd)
  • alignment_asserts: validates memory alignment (~60 calls/fwd, already disabled by default in fbcode)
  • scalar_asserts: validates dynamic shape constraints (no-op for vLLM since dynamic=False)

Users can re-enable assertions for debugging via:

--compilation-config '{"inductor_compile_config": {"size_asserts": true, "alignment_asserts": true, "scalar_asserts": true}}'

PyTorch is also working on an assert-once solution upstream: https://github.com/pytorch/pytorch/issues/177719

Purpose

Disable size_asserts, alignment_asserts, and scalar_asserts in inductor_compile_config to avoid ~2ms overhead per forward pass on large models.

Test Plan

DeepSeek-R1 671B, TP=8, B200 GPU, vLLM 0.16.0, request rate=15.

Test Result

ConfigTPOT (ms)Δ
inductor (default)81.21baseline
inductor (asserts off)79.13−2.08ms (−2.6%)
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • docs/design/debug_vllm_compile.md (modified, +20/-0)
  • tests/compile/test_config.py (modified, +56/-0)
  • vllm/config/compilation.py (modified, +19/-0)

Code Example

size_asserts = os.environ.get("TORCHINDUCTOR_SIZE_ASSERTS",
    "0" if torch.is_inference_mode_enabled() else "1") == "1"

---

# put correctness assertions in generated code
size_asserts = os.environ.get("TORCHINDUCTOR_SIZE_ASSERTS", "1") == "1"

---

def call(args):
    arg0_1, arg1_1, arg2_1, ... = args
    assert_size_stride(arg0_1, (s0, 7168), (7168, 1))   # ← debug check
    assert_size_stride(arg1_1, (s0, 128), (128, 1))      # ← debug check
    ...

---

static PyObject* assert_size_stride(PyObject* dummy, PyObject* args) {
  // "Assert that a given tensor has a given size/stride,
  //  but ignore strides of size==1 dimensions.
  //  Implemented in C++ as this is on the hot path."
  ...
  // Error message:
  msg << "This error most often comes from a incorrect fake (aka meta) "
      << "kernel for a custom op.";
}

---

# torch/_inductor/config.py:234
   # Disable by default in fbcode
   alignment_asserts = (
       os.environ.get("TORCHINDUCTOR_ALIGNMENT_ASSERTS",
                       "0" if is_fbcode() else "1") == "1"
   )

---

if config.size_asserts:
    def run(new_inputs):
        # includes: assert dst.data_ptr() == src.data_ptr()
        ...
else:
    # optimized path: pre-computed copy_indices, no asserts
    copy_indices = [idx for idx in range(...) if idx not in static_input_idxs]
    def run(new_inputs):
        for idx in copy_indices:
            index_expanded_dims_and_copy_(static_inputs[idx], src, expanded_dims)
        ...
RAW_BUFFERClick to expand / collapse

Disabling inductor's runtime correctness assertions (size_asserts, scalar_asserts, alignment_asserts) improved mean TPOT by ~3ms (71.50 → 68.35ms, −4.4%) on DeepSeek-R1 671B (MoE, TP=8) in serve mode at rate=15. The improvement comes from eliminating ~86K assert_size_stride calls on the non-CUDAGraph code path, as well as a more optimized CUDAGraph replay path when size_asserts is disabled.

These asserts are correctness checks that verify the inductor's compile-time shape analysis matches runtime tensor shapes. They never fire when the compilation is correct, and add pure overhead in production.

We should consider disable size_asserts, alignment_asserts and scalar_asserts for vllm inference. Similar to how alignment_asserts is already disabled by default in internal fbcode. Or maybe it should run only once at the beginning instead of in every inductor call.

Proposal

Off for Inference Only. Alternatively, might be better to assert on first use instead of assert them all at the beginning of the output_code.py

size_asserts = os.environ.get("TORCHINDUCTOR_SIZE_ASSERTS",
    "0" if torch.is_inference_mode_enabled() else "1") == "1"

Benchmark Data

Model: DeepSeek-R1 671B (MoE, FP8, TP=8 on 8× NVIDIA B200) Workload: vLLM 0.16.0 serving, 450 prompts @ rate=15, fuse_norm_quant=false

ConfigurationMean TPOT (ms)vs Inductor Baseline
Inductor (default, asserts ON)71.50
Inductor + SIZE_ASSERTS=068.35−3.15ms (−4.4%)
Eager baseline69.58−1.92ms

With SIZE_ASSERTS=0, inductor becomes 1.23ms faster than eager (p < 0.05, Cohen's d = 3.20).

Perfetto Trace Verification

Traceassert_size_stride search hits
Inductor (asserts ON)86,284
Inductor + SIZE_ASSERTS=00
  • Screenshot 1: inductor trace showing assert_size_stride 86,284 hits <img width="1419" height="686" alt="Image" src="https://github.com/user-attachments/assets/f833ba8d-7391-4803-b419-f7a278b5615a" />
  • Screenshot 2: noassert trace showing 0 hits <img width="1106" height="594" alt="Image" src="https://github.com/user-attachments/assets/7a6921a0-6064-499d-b03c-2f29279b7fbf" />

The duration reduced from 94us to 51us. The end2end benchmark shows that inductor becomes faster than eager on vllm Deepseek r1

There is a env flag controlling this flag. From torch/_inductor/config.py:225:

# put correctness assertions in generated code
size_asserts = os.environ.get("TORCHINDUCTOR_SIZE_ASSERTS", "1") == "1"

The generated code looks like:

def call(args):
    arg0_1, arg1_1, arg2_1, ... = args
    assert_size_stride(arg0_1, (s0, 7168), (7168, 1))   # ← debug check
    assert_size_stride(arg1_1, (s0, 128), (128, 1))      # ← debug check
    ...

The C++ implementation (torch/csrc/dynamo/guards.cpp:926) validates that each tensor's runtime size/stride matches what inductor predicted at compile time:

static PyObject* assert_size_stride(PyObject* dummy, PyObject* args) {
  // "Assert that a given tensor has a given size/stride,
  //  but ignore strides of size==1 dimensions.
  //  Implemented in C++ as this is on the hot path."
  ...
  // Error message:
  msg << "This error most often comes from a incorrect fake (aka meta) "
      << "kernel for a custom op.";
}

It catch bugs in custom op meta kernels where the fake tensor implementation returns wrong output shapes. We can consider disable it during inference for vllm models because:

  1. Overhead on large models: vllm DeepSeek-R1 has 122 piecewise CUDA graph partitions × ~700 assert calls per step = 86K calls in total. The overhead scales with model complexity while providing zero value in production.

  2. Precedent in PyTorch: alignment_asserts is already disabled by default in fbcode:

    # torch/_inductor/config.py:234
    # Disable by default in fbcode
    alignment_asserts = (
        os.environ.get("TORCHINDUCTOR_ALIGNMENT_ASSERTS",
                        "0" if is_fbcode() else "1") == "1"
    )

Each assert_size_stride call:

  1. Python function dispatch overhead
  2. THPVariable_Unpack(item) — unwrap Python tensor
  3. tensor.size(i) + tensor.stride(i) — read metadata for each dimension
  4. THPUtils_unpackLong() — unpack expected values from Python tuples
  5. String formatting for error messages (allocated but never used)

Additionally, compile_fx.py:1954 shows that size_asserts also controls extra checks in the CUDA graph replay path:

if config.size_asserts:
    def run(new_inputs):
        # includes: assert dst.data_ptr() == src.data_ptr()
        ...
else:
    # optimized path: pre-computed copy_indices, no asserts
    copy_indices = [idx for idx in range(...) if idx not in static_input_idxs]
    def run(new_inputs):
        for idx in copy_indices:
            index_expanded_dims_and_copy_(static_inputs[idx], src, expanded_dims)
        ...

Environment

  • PyTorch: 2.12.0a0+gitb05b2d3 (nightly)
  • Model: DeepSeek-R1 671B (MoE, FP8 quantized)
  • Hardware: 8× NVIDIA B200 (TP=8)
  • vLLM: 0.16.0rc2
  • Inductor config: fuse_norm_quant=false, piecewise CUDA graphs

cc @chauhang @penguinwu @zou3519 @pytorchbot

extent analysis

Fix Plan

To disable size_asserts, alignment_asserts, and scalar_asserts for vLLM inference, follow these steps:

  1. Set environment variable: Before running your PyTorch application, set the TORCHINDUCTOR_SIZE_ASSERTS environment variable to 0 to disable size assertions.

export TORCHINDUCTOR_SIZE_ASSERTS=0

2. **Modify PyTorch configuration**: Alternatively, you can modify the PyTorch configuration to disable size assertions by default for inference mode. Update the `torch/_inductor/config.py` file to include the following code:
   ```python
size_asserts = os.environ.get("TORCHINDUCTOR_SIZE_ASSERTS",
    "0" if torch.is_inference_mode_enabled() else "1") == "1"
  1. Disable alignment assertions: If not already disabled, set the TORCHINDUCTOR_ALIGNMENT_ASSERTS environment variable to 0 to disable alignment assertions.

export TORCHINDUCTOR_ALIGNMENT_ASSERTS=0

4. **Verify the fix**: Run your application with the updated configuration and verify that the size assertions are disabled by checking the performance metrics.

### Verification
To verify that the fix worked, you can:

1. **Check performance metrics**: Compare the mean TPOT (ms) before and after applying the fix to ensure that the performance has improved.
2. **Use Perfetto Trace**: Run Perfetto Trace to verify that the `assert_size_stride` calls have been eliminated.
3. **Monitor error messages**: Check for any error messages related to size assertions to ensure that they are no longer occurring.

### Extra Tips
* Make sure to test your application thoroughly after applying the fix to ensure that it is working as expected.
* Consider setting up a benchmarking framework to regularly test the performance of your application and detect any regressions.
* If you encounter any issues or errors after applying the fix, you can try re-enabling the size assertions to help diagnose the problem.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - ✅(Solved) Fix [Discussion] Disable size_asserts by default in inductor for vllm x compile inference [2 pull requests, 5 comments, 4 participants]