pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][Inductor] assert_size_stride (8192 vs 512) in BERT-with-rope AOT-compiled path for Nomic v2 MoE pooling model

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

On torch==2.12.0 + triton==3.7.0, the Extended-Pooling pytest job for the nomic-ai/nomic-embed-text-v2-moe model fails inside an Inductor AOT-compiled forward with a runtime stride/shape assertion:

File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/<hash>/inductor_cache/ra/cra7eslwqvpfcdh65siik6ag5goi4wdpxi6ivjnoyjbr4zci2oje.py", line 664
    assert_size_stride(arg9_1, (512, 64), (64, 1))
AssertionError: expected size 8192==512, stride 64==64 at dim=0
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.

The compiled function's entrypoint is vllm/model_executor/models/bert_with_rope.py:476::forward. The path is the flashinfer_autotune → runner._dummy_run kernel warmup that runs as part of _initialize_kv_caches.

The expected dim-0 size in the compiled artifact is 512 (the configured max_model_len) but the runtime input has dim-0 size 8192 — exactly 16× larger, consistent with an MoE expert-broadcast (Nomic v2 MoE has 16 experts) where the fake kernel for the MoE custom op declared the post-MoE shape as (seq_len, hidden) while the runtime allocation is (seq_len * num_experts, hidden).

Inductor's hint about a wrong fake kernel matches the diagnosis in vllm-project/vllm#40960 / pytorch/pytorch#182328 (the gpt-oss moe_forward case) — likely a similar vLLM-side register_fake / direct_register_custom_op shape mismatch, specific to the Nomic v2 MoE path. Root cause is probably in vLLM, not torch.

Error Message

File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/<hash>/inductor_cache/ra/cra7eslwqvpfcdh65siik6ag5goi4wdpxi6ivjnoyjbr4zci2oje.py", line 664 assert_size_stride(arg9_1, (512, 64), (64, 1)) AssertionError: expected size 8192==512, stride 64==64 at dim=0 This error most often comes from a incorrect fake (aka meta) kernel for a custom op.

Root Cause

Inductor's hint about a wrong fake kernel matches the diagnosis in vllm-project/vllm#40960 / pytorch/pytorch#182328 (the gpt-oss moe_forward case) — likely a similar vLLM-side register_fake / direct_register_custom_op shape mismatch, specific to the Nomic v2 MoE path. Root cause is probably in vLLM, not torch.

Code Example

File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/<hash>/inductor_cache/ra/cra7eslwqvpfcdh65siik6ag5goi4wdpxi6ivjnoyjbr4zci2oje.py", line 664
    assert_size_stride(arg9_1, (512, 64), (64, 1))
AssertionError: expected size 8192==512, stride 64==64 at dim=0
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.

---

pytest -x tests/models/language/pooling/test_nomic_max_model_len.py::test_use_rope_scaling_legal[model_info1]

---

vllm/v1/engine/core.py:283 _initialize_kv_caches
  → vllm/v1/executor/abstract.py:124 initialize_from_config
    → vllm/v1/worker/gpu_worker.py:608 compile_or_warm_up_model
      → vllm/model_executor/warmup/kernel_warmup.py:133 flashinfer_autotune
        → runner._dummy_run
          → vllm/model_executor/models/bert_with_rope.py:476 forward
AOT compiled fn → Inductor output_code
assert_size_stride(arg9_1, (512, 64), (64, 1))
              AssertionError: expected size 8192==512, stride 64==64 at dim=0
RAW_BUFFERClick to expand / collapse

Summary

On torch==2.12.0 + triton==3.7.0, the Extended-Pooling pytest job for the nomic-ai/nomic-embed-text-v2-moe model fails inside an Inductor AOT-compiled forward with a runtime stride/shape assertion:

File "/root/.cache/vllm/torch_compile_cache/torch_aot_compile/<hash>/inductor_cache/ra/cra7eslwqvpfcdh65siik6ag5goi4wdpxi6ivjnoyjbr4zci2oje.py", line 664
    assert_size_stride(arg9_1, (512, 64), (64, 1))
AssertionError: expected size 8192==512, stride 64==64 at dim=0
This error most often comes from a incorrect fake (aka meta) kernel for a custom op.

The compiled function's entrypoint is vllm/model_executor/models/bert_with_rope.py:476::forward. The path is the flashinfer_autotune → runner._dummy_run kernel warmup that runs as part of _initialize_kv_caches.

The expected dim-0 size in the compiled artifact is 512 (the configured max_model_len) but the runtime input has dim-0 size 8192 — exactly 16× larger, consistent with an MoE expert-broadcast (Nomic v2 MoE has 16 experts) where the fake kernel for the MoE custom op declared the post-MoE shape as (seq_len, hidden) while the runtime allocation is (seq_len * num_experts, hidden).

Inductor's hint about a wrong fake kernel matches the diagnosis in vllm-project/vllm#40960 / pytorch/pytorch#182328 (the gpt-oss moe_forward case) — likely a similar vLLM-side register_fake / direct_register_custom_op shape mismatch, specific to the Nomic v2 MoE path. Root cause is probably in vLLM, not torch.

Environment

  • torch==2.12.0+cu130
  • triton==3.7.0
  • torchvision==0.27.0
  • CUDA 13.0
  • GPU: H200 (Buildkite h200-ci-6-3)
  • Python 3.12
  • vLLM commit d2792bf2088c (PR vllm-project/vllm#42848)
  • Model: nomic-ai/nomic-embed-text-v2-moe, runner=pooling, max_model_len=512, dtype=auto (downcast to fp16)

Reproduction

pytest -x tests/models/language/pooling/test_nomic_max_model_len.py::test_use_rope_scaling_legal[model_info1]

model_info1 selects nomic-ai/nomic-embed-text-v2-moe; the test instantiates VllmRunner with hf_overrides={"rope_parameters": {"rope_theta": ..., "rope_type": "yarn", "factor": ...}} and max_model_len=512. Crash happens during flashinfer_autotune warmup.

Traceback (trimmed)

vllm/v1/engine/core.py:283 _initialize_kv_caches
  → vllm/v1/executor/abstract.py:124 initialize_from_config
    → vllm/v1/worker/gpu_worker.py:608 compile_or_warm_up_model
      → vllm/model_executor/warmup/kernel_warmup.py:133 flashinfer_autotune
        → runner._dummy_run
          → vllm/model_executor/models/bert_with_rope.py:476 forward
            → AOT compiled fn → Inductor output_code
              → assert_size_stride(arg9_1, (512, 64), (64, 1))
              AssertionError: expected size 8192==512, stride 64==64 at dim=0

Diagnosis question

Is the assert_size_stride insertion behavior new in 2.12 for this code path (e.g. a Tag.needs_fixed_stride_order change that started routing the MoE expert-broadcast through the assertion)? If yes, vLLM should fix the fake kernel; this can be tracked vLLM-side and closed not_planned here. If the assertion existed before but the fake kernel was bypassed under 2.11, please confirm so we can prioritize the vLLM-side fake-impl fix.

Links

cc @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @aakhundov @coconutruben @jataylo @avikchaudhuri @zhxchen17 @tugsbayasgalan @angelayi @ydwu4 @bdhirsh @bobrenjc93 @aorenste @desertfire @yushangdi @iupaikov-amd

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

pytorch - 💡(How to fix) Fix [vllm] [2.12 regression][Inductor] assert_size_stride (8192 vs 512) in BERT-with-rope AOT-compiled path for Nomic v2 MoE pooling model