pytorch - 💡(How to fix) Fix Inductor failure with `view_as_complex` in fused Multi-Head Attention subgraph [1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#176986Fetched 2026-04-08 00:23:18
View on GitHub
Comments
0
Participants
1
Timeline
145
Reactions
0
Author
Participants
Timeline (top)
subscribed ×70mentioned ×69labeled ×5cross-referenced ×1

torch.compile with the Inductor backend raises a RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension when attempting to fuse a subgraph containing Multi-Head Rotary Positional Embeddings (RoPE) and a subsequent matmul.

The error specifically occurs when two parallel complex-valued operation chains (representing Q and K rotations) are fused into a consumer matmul. Eager mode handles the same inputs correctly.

Error Message

import torch import torch.nn as nn

class InductorBugRepro(nn.Module): def init(self, d_model): super().init() self.w_q = nn.Parameter(torch.randn(d_model, d_model)) self.w_k = nn.Parameter(torch.randn(d_model, d_model))

def forward(self, x, rope):
    B, L, _ = x.shape
    H, D = 16, 64
    
    # 1. Linear projection + Transpose (Standard MHSA pattern)
    q = (x @ self.w_q.T).view(B, L, H, D).transpose(1, 2)
    k = (x @ self.w_k.T).view(B, L, H, D).transpose(1, 2)
    
    # 2. Parallel complex op chains (RoPE pattern)
    # Each chain uses .contiguous() before view_as_complex
    q_rot = torch.view_as_real(torch.view_as_complex(q.view(*q.shape[:-1], -1, 2).contiguous()) * rope)
    k_rot = torch.view_as_real(torch.view_as_complex(k.view(*k.shape[:-1], -1, 2).contiguous()) * rope)
    
    # 3. Matmul consumer forces fusion of the RoPE outputs
    return torch.matmul(q_rot.view(*q.shape), k_rot.view(*k.shape).transpose(-1, -2))

def run(): device = "cuda" B, H, L, D = 1, 16, 1, 64 d_model = H * D

model = InductorBugRepro(d_model).to(device)
compiled_model = torch.compile(model)

x = torch.randn(B, L, d_model, device=device)
rope = torch.randn(L, D // 2, dtype=torch.complex64, device=device)

print(f"Running reproduction on {device}...")

# Passes
_ = model(x, rope)
print("Eager mode: Success")

# Fails
try:
    _ = compiled_model(x, rope)
    print("Compiled mode: Success")
except Exception as e:
    print(f"Compiled mode: FAILED\n\nError:\n{e}")

if name == "main": run()

Root Cause

torch.compile with the Inductor backend raises a RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension when attempting to fuse a subgraph containing Multi-Head Rotary Positional Embeddings (RoPE) and a subsequent matmul.

The error specifically occurs when two parallel complex-valued operation chains (representing Q and K rotations) are fused into a consumer matmul. Eager mode handles the same inputs correctly.

Fix Action

Fix / Workaround

Output Log

Running reproduction on cuda...
Eager mode: Success
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] failed while attempting to run meta for aten.view_as_complex.default
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] Traceback (most recent call last):
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]   File "/usr/local/lib/python3.12/dist-packages/torch/_subclasses/fake_tensor.py", line 2823, in _dispatch_impl
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]     r = func(*args, **kwargs)
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]         ^^^^^^^^^^^^^^^^^^^^^
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]     return self._op(*args, **kwargs)
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension
Compiled mode: FAILED

Code Example

import torch
import torch.nn as nn

class InductorBugRepro(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.w_q = nn.Parameter(torch.randn(d_model, d_model))
        self.w_k = nn.Parameter(torch.randn(d_model, d_model))

    def forward(self, x, rope):
        B, L, _ = x.shape
        H, D = 16, 64
        
        # 1. Linear projection + Transpose (Standard MHSA pattern)
        q = (x @ self.w_q.T).view(B, L, H, D).transpose(1, 2)
        k = (x @ self.w_k.T).view(B, L, H, D).transpose(1, 2)
        
        # 2. Parallel complex op chains (RoPE pattern)
        # Each chain uses .contiguous() before view_as_complex
        q_rot = torch.view_as_real(torch.view_as_complex(q.view(*q.shape[:-1], -1, 2).contiguous()) * rope)
        k_rot = torch.view_as_real(torch.view_as_complex(k.view(*k.shape[:-1], -1, 2).contiguous()) * rope)
        
        # 3. Matmul consumer forces fusion of the RoPE outputs
        return torch.matmul(q_rot.view(*q.shape), k_rot.view(*k.shape).transpose(-1, -2))

def run():
    device = "cuda"
    B, H, L, D = 1, 16, 1, 64
    d_model = H * D
    
    model = InductorBugRepro(d_model).to(device)
    compiled_model = torch.compile(model)
    
    x = torch.randn(B, L, d_model, device=device)
    rope = torch.randn(L, D // 2, dtype=torch.complex64, device=device)
    
    print(f"Running reproduction on {device}...")
    
    # Passes
    _ = model(x, rope)
    print("Eager mode: Success")
    
    # Fails
    try:
        _ = compiled_model(x, rope)
        print("Compiled mode: Success")
    except Exception as e:
        print(f"Compiled mode: FAILED\n\nError:\n{e}")

if __name__ == "__main__":
    run()

---

Running reproduction on cuda...
Eager mode: Success
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] failed while attempting to run meta for aten.view_as_complex.default
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] Traceback (most recent call last):
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]   File "/usr/local/lib/python3.12/dist-packages/torch/_subclasses/fake_tensor.py", line 2823, in _dispatch_impl
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]     r = func(*args, **kwargs)
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]         ^^^^^^^^^^^^^^^^^^^^^
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]     return self._op(*args, **kwargs)
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension
Compiled mode: FAILED

Error:
backend='inductor' raised:
RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Error detected in ViewAsRealBackward0. Traceback of forward call that caused the error:
  File "/tmp/ipykernel_463/4071400647.py", line 21, in forward
    k_rot = torch.view_as_real(torch.view_as_complex(k.view(*k.shape[:-1], -1, 2).contiguous()) * rope)
 (Triggered internally at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RAW_BUFFERClick to expand / collapse

🐛 Describe the bug

Summary

torch.compile with the Inductor backend raises a RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension when attempting to fuse a subgraph containing Multi-Head Rotary Positional Embeddings (RoPE) and a subsequent matmul.

The error specifically occurs when two parallel complex-valued operation chains (representing Q and K rotations) are fused into a consumer matmul. Eager mode handles the same inputs correctly.

Minimal Reproduction Script

import torch
import torch.nn as nn

class InductorBugRepro(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.w_q = nn.Parameter(torch.randn(d_model, d_model))
        self.w_k = nn.Parameter(torch.randn(d_model, d_model))

    def forward(self, x, rope):
        B, L, _ = x.shape
        H, D = 16, 64
        
        # 1. Linear projection + Transpose (Standard MHSA pattern)
        q = (x @ self.w_q.T).view(B, L, H, D).transpose(1, 2)
        k = (x @ self.w_k.T).view(B, L, H, D).transpose(1, 2)
        
        # 2. Parallel complex op chains (RoPE pattern)
        # Each chain uses .contiguous() before view_as_complex
        q_rot = torch.view_as_real(torch.view_as_complex(q.view(*q.shape[:-1], -1, 2).contiguous()) * rope)
        k_rot = torch.view_as_real(torch.view_as_complex(k.view(*k.shape[:-1], -1, 2).contiguous()) * rope)
        
        # 3. Matmul consumer forces fusion of the RoPE outputs
        return torch.matmul(q_rot.view(*q.shape), k_rot.view(*k.shape).transpose(-1, -2))

def run():
    device = "cuda"
    B, H, L, D = 1, 16, 1, 64
    d_model = H * D
    
    model = InductorBugRepro(d_model).to(device)
    compiled_model = torch.compile(model)
    
    x = torch.randn(B, L, d_model, device=device)
    rope = torch.randn(L, D // 2, dtype=torch.complex64, device=device)
    
    print(f"Running reproduction on {device}...")
    
    # Passes
    _ = model(x, rope)
    print("Eager mode: Success")
    
    # Fails
    try:
        _ = compiled_model(x, rope)
        print("Compiled mode: Success")
    except Exception as e:
        print(f"Compiled mode: FAILED\n\nError:\n{e}")

if __name__ == "__main__":
    run()

Output Log

Running reproduction on cuda...
Eager mode: Success
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] failed while attempting to run meta for aten.view_as_complex.default
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] Traceback (most recent call last):
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]   File "/usr/local/lib/python3.12/dist-packages/torch/_subclasses/fake_tensor.py", line 2823, in _dispatch_impl
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]     r = func(*args, **kwargs)
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]         ^^^^^^^^^^^^^^^^^^^^^
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]   File "/usr/local/lib/python3.12/dist-packages/torch/_ops.py", line 819, in __call__
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]     return self._op(*args, **kwargs)
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0]            ^^^^^^^^^^^^^^^^^^^^^^^^^
E0310 03:16:58.968000 463 torch/_subclasses/fake_tensor.py:2827] [1/0] RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension
Compiled mode: FAILED

Error:
backend='inductor' raised:
RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension

Set TORCHDYNAMO_VERBOSE=1 for the internal stack trace (please do this especially if you're reporting a bug to PyTorch). For even more developer context, set TORCH_LOGS="+dynamo"

/usr/local/lib/python3.12/dist-packages/torch/autograd/graph.py:865: UserWarning: Error detected in ViewAsRealBackward0. Traceback of forward call that caused the error:
  File "/tmp/ipykernel_463/4071400647.py", line 21, in forward
    k_rot = torch.view_as_real(torch.view_as_complex(k.view(*k.shape[:-1], -1, 2).contiguous()) * rope)
 (Triggered internally at /pytorch/torch/csrc/autograd/python_anomaly_mode.cpp:122.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

Expected Behavior

The compiled model should execute correctly and yield results consistent with eager mode. Inductor should respect the .contiguous() call which ensures the stride requirements for view_as_complex are met, even when fusing across the transpose and matmul operations.

Actual Behavior

The compiler fails during the fake_tensor propagation or kernel lowering phase with: RuntimeError: Tensor must have a stride divisible by 2 for all but last dimension

This error indicates that despite the explicit .contiguous() call, Inductor's internal memory planning for the fused kernel creates a layout where the real and imaginary components of the complex view are not physically adjacent in memory.

Versions

Reproduced environment:

  • PyTorch Version: 2.10.0
  • Backend: Inductor
  • Device: CPU, CUDA, MPS

cc @ezyang @anjali411 @dylanbespalko @mruberry @nikitaved @amjames @chauhang @penguinwu @voznesenskym @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @aakhundov @coconutruben @jataylo

extent analysis

Fix Plan

1. Update PyTorch to the latest version

Ensure you are running the latest version of PyTorch. The issue might be resolved in newer versions.

2. Disable Inductor backend for the specific operation

You can disable the Inductor backend for the specific operation that's causing the issue. This can be done by setting the torch.compile option backend to None or by using the torch.compile option exclude to exclude the specific operation.

compiled_model = torch.compile(model, backend=None)

or

compiled_model = torch.compile(model, exclude=['torch.view_as_complex'])

3. Use a different backend

If disabling the Inductor backend doesn't resolve the issue, try using a different backend, such as CUDA or CPU.

compiled_model = torch.compile(model, backend='cuda')

or

compiled_model = torch.compile(model, backend='cpu')

4. Modify the code to avoid the issue

If none of the above solutions work, you can try modifying the code to avoid the issue. For example, you can use a different method to create complex tensors.

q_rot = torch.randn(B, L, H, D // 2, 2, device=device)
k_rot = torch.randn(B, L, H, D // 2, 2, device=device)

Verification

To verify that the fix worked, run the reproduction script again and check if the error is resolved.

if __name__ == "__main__":
    run()

If the error is resolved, the script should run without any issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING