pytorch - 💡(How to fix) Fix [cuDNN][SDPA] head_dim limit stuck at 128 on sm_120 (RTX 5090), blocking head_dim=256 models [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
pytorch/pytorch#181379Fetched 2026-04-25 06:02:39
View on GitHub
Comments
1
Participants
2
Timeline
48
Reactions
0
Participants
Timeline (top)
mentioned ×20subscribed ×20labeled ×7commented ×1

On consumer Blackwell (sm_120, RTX 5090), check_cudnn_tensor_shapes caps head_dim_limit at 128 even though check_cudnn_hardware_support already whitelists the hardware in the [sm_80, sm_121] range. This silently rejects cuDNN attention for any head_dim=256 model (Gemma 3, Qwen3 ≥14B, Llama 3.1 70B, etc.) and forces a fall-through to the MATH backend with its O(seq²) fp32 softmax — OOMs at long context.

Adjacent prior art: PR #172621 (Jan 2026, merged) did the same kind of lift for datacenter Blackwell (sm_10.0/10.3 DeepSeek head_dim=192). The workstation/consumer sm_120 line was simply not included.

Error Message

import torch import torch.nn.functional as F from torch.nn.attention import SDPBackend, sdpa_kernel

def try_backend(backend, head_dim, seq=256, heads=8, bs=1, dtype=torch.bfloat16): q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) try: with sdpa_kernel([backend]): F.scaled_dot_product_attention(q, k, v, is_causal=True) return "ACCEPT" except RuntimeError: return "REJECT"

for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]: print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]})

Root Cause

On consumer Blackwell (sm_120, RTX 5090), check_cudnn_tensor_shapes caps head_dim_limit at 128 even though check_cudnn_hardware_support already whitelists the hardware in the [sm_80, sm_121] range. This silently rejects cuDNN attention for any head_dim=256 model (Gemma 3, Qwen3 ≥14B, Llama 3.1 70B, etc.) and forces a fall-through to the MATH backend with its O(seq²) fp32 softmax — OOMs at long context.

Adjacent prior art: PR #172621 (Jan 2026, merged) did the same kind of lift for datacenter Blackwell (sm_10.0/10.3 DeepSeek head_dim=192). The workstation/consumer sm_120 line was simply not included.

Fix Action

Fix / Workaround

  1. Is this a one-line patch you'd accept a PR for, or is there a broader Blackwell-shape-support effort in flight that we should wait on?

Code Example

import torch
import torch.nn.functional as F
from torch.nn.attention import SDPBackend, sdpa_kernel

def try_backend(backend, head_dim, seq=256, heads=8, bs=1,
                dtype=torch.bfloat16):
    q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    try:
        with sdpa_kernel([backend]):
            F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return "ACCEPT"
    except RuntimeError:
        return "REJECT"

for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION,
           SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]:
    print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]})

---

auto head_dim_limit = 128;
// Hopper: head_dim<=256 support with cuDNN >= 9.10.0
if (cudnn_version >= 91000) {
  auto dprops = at::cuda::getCurrentDeviceProperties();
  if (dprops->major == 9 && !dprops->minor) {       // only sm_90
    head_dim_limit = 256;
  }
}
// (sm_10.x DeepSeek head_dim=192 case — irrelevant for sm_120)

---

auto head_dim_limit = 128;
-  // Hopper: head_dim<=256 support with cuDNN >= 9.10.0
+  // Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090):
+  // head_dim<=256 support with cuDNN >= 9.10.0.
   if (cudnn_version >= 91000) {
     auto dprops = at::cuda::getCurrentDeviceProperties();
-    if (dprops->major == 9 && !dprops->minor) {
+    if ((dprops->major == 9 && !dprops->minor) ||
+        (dprops->major == 12 && dprops->minor == 0)) {
       head_dim_limit = 256;
     }
   }
RAW_BUFFERClick to expand / collapse

Summary

On consumer Blackwell (sm_120, RTX 5090), check_cudnn_tensor_shapes caps head_dim_limit at 128 even though check_cudnn_hardware_support already whitelists the hardware in the [sm_80, sm_121] range. This silently rejects cuDNN attention for any head_dim=256 model (Gemma 3, Qwen3 ≥14B, Llama 3.1 70B, etc.) and forces a fall-through to the MATH backend with its O(seq²) fp32 softmax — OOMs at long context.

Adjacent prior art: PR #172621 (Jan 2026, merged) did the same kind of lift for datacenter Blackwell (sm_10.0/10.3 DeepSeek head_dim=192). The workstation/consumer sm_120 line was simply not included.

Empirical reproducer (RTX 5090, torch 2.8+cu129, cuDNN 9.10.02)

import torch
import torch.nn.functional as F
from torch.nn.attention import SDPBackend, sdpa_kernel

def try_backend(backend, head_dim, seq=256, heads=8, bs=1,
                dtype=torch.bfloat16):
    q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    try:
        with sdpa_kernel([backend]):
            F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return "ACCEPT"
    except RuntimeError:
        return "REJECT"

for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION,
           SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]:
    print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]})
Backendhd=128hd=192hd=256
CUDNN_ATTENTIONACCEPTREJECTREJECT
FLASH_ATTENTIONACCEPTACCEPTACCEPT
EFFICIENT_ATTENTIONACCEPTACCEPTACCEPT
MATHACCEPTACCEPTACCEPT

cuDNN error at rejection: "head_dim should be no more than 128" — exactly the warning from sdp_utils.cpp:522.

Source analysis

sdp_utils.cpp:504-525 (release/2.11):

auto head_dim_limit = 128;
// Hopper: head_dim<=256 support with cuDNN >= 9.10.0
if (cudnn_version >= 91000) {
  auto dprops = at::cuda::getCurrentDeviceProperties();
  if (dprops->major == 9 && !dprops->minor) {       // only sm_90
    head_dim_limit = 256;
  }
}
// (sm_10.x DeepSeek head_dim=192 case — irrelevant for sm_120)

sm_120 never enters the head_dim_limit = 256 branch, falls through with = 128. Since check_cudnn_hardware_support at lines 635-656 already accepts sm_120 (it's inside the [sm_80, sm_121] range), this is purely the shape check that's out of sync.

Proposed fix

   auto head_dim_limit = 128;
-  // Hopper: head_dim<=256 support with cuDNN >= 9.10.0
+  // Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090):
+  // head_dim<=256 support with cuDNN >= 9.10.0.
   if (cudnn_version >= 91000) {
     auto dprops = at::cuda::getCurrentDeviceProperties();
-    if (dprops->major == 9 && !dprops->minor) {
+    if ((dprops->major == 9 && !dprops->minor) ||
+        (dprops->major == 12 && dprops->minor == 0)) {
       head_dim_limit = 256;
     }
   }

Open questions for maintainers

  1. Does cuDNN 9.10+ actually handle head_dim=256 kernels on sm_120? The pytorch gate is currently the only blocker per my reading, but if cuDNN rejects shape-wise independently on sm_120, lifting the pytorch gate would move the error from a readable warning to an opaque kernel-launch failure.

  2. Is this a one-line patch you'd accept a PR for, or is there a broader Blackwell-shape-support effort in flight that we should wait on?

  3. Any preferred test location? test_transformers.py seems to be where similar arch-gated checks live.

First-time pytorch contributor disclosure

This is my first issue on pytorch core. Apologies in advance for any style deviations. Happy to open a PR if the direction is right.


🤖 Drafted with Claude Code (Claude Opus 4.7), reviewed and posted by me.

cc @csarofeen @ptrblck @eqy @nWEIdia @msaroufim @jerryzh168 @tinglvv @drisspg @liangel-02 @howardzhang-cv

extent analysis

TL;DR

The proposed fix involves updating the head_dim_limit check in sdp_utils.cpp to include support for sm_120 architecture with cuDNN version 9.10 or higher.

Guidance

  • Review the proposed fix and verify that it correctly updates the head_dim_limit for sm_120 architecture.
  • Test the updated code with the provided empirical reproducer to ensure that it resolves the issue.
  • Consider adding additional tests to cover this scenario, potentially in test_transformers.py.
  • Before merging the fix, confirm with maintainers that cuDNN 9.10+ actually handles head_dim=256 kernels on sm_120.

Example

The proposed fix is provided as a diff:

auto head_dim_limit = 128;
// Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090):
// head_dim<=256 support with cuDNN >= 9.10.0.
if (cudnn_version >= 91000) {
  auto dprops = at::cuda::getCurrentDeviceProperties();
  if ((dprops->major == 9 && !dprops->minor) ||
      (dprops->major == 12 && dprops->minor == 0)) {
    head_dim_limit = 256;
  }
}

Notes

The fix assumes that cuDNN 9.10+ supports head_dim=256 kernels on sm_120. If this is not the case, the fix may need to be revised.

Recommendation

Apply the proposed workaround by updating the head_dim_limit check in sdp_utils.cpp to include support for sm_120 architecture with cuDNN version 9.10 or higher, after confirming with maintainers that cuDNN 9.10+ actually handles head_dim=256 kernels on `sm_

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING