pytorch - 💡(How to fix) Fix [cuDNN][SDPA] head_dim limit stuck at 128 on sm_120 (RTX 5090), blocking head_dim=256 models [1 comments, 2 participants]

dvdimitrov13 · 2026-04-24T14:56:22Z

[pytorch] On consumer Blackwell sm 120, RTX 5090 , check cudnn tensor shapes caps head dim limit at 128 even though check cudnn hardware support already whitel… On consumer Blackwell (sm_120, RTX 5090), `check_cudnn_tensor_shapes` caps `head_dim_limit` at 128 even though `check_cudnn_hardware_support` already whitelists the hardware in the `[sm_80, sm_121]` range. This silently rejects cuDNN attention for any head_dim=256 model (Gemma 3, Qwen3 ≥14B, Llama 3.1 70B, etc.) and forces a fall-through to the MATH backend with its `O(seq²)` fp32 softmax — OOMs at long context. Adjacent prior art: PR #172621 (Jan 2026, merged) did the same kind of lift for datacenter Blackwell (sm_10.0/10.3 DeepSeek head_dim=192). The workstation/consumer sm_120 line was simply not included. ## Fix / Workaround 2. Is this a one-line patch you'd accept a PR for, or is there a broader Blackwell-shape-support effort in flight that we should wait on? ## Summary On consumer Blackwell (sm_120, RTX 5090), `check_cudnn_tensor_shapes` caps `head_dim_limit` at 128 even though `check_cudnn_hardware_support` already whitelists the hardware in the `[sm_80, sm_121]` range. This silently rejects cuDNN attention for any head_dim=256 model (Gemma 3, Qwen3 ≥14B, Llama 3.1 70B, etc.) and forces a fall-through to the MATH backend with its `O(seq²)` fp32 softmax — OOMs at long context. Adjacent prior art: PR #172621 (Jan 2026, merged) did the same kind of lift for datacenter Blackwell (sm_10.0/10.3 DeepSeek head_dim=192). The workstation/consumer sm_120 line was simply not included. ## Empirical reproducer (RTX 5090, torch 2.8+cu129, cuDNN 9.10.02) ```python import torch import torch.nn.functional as F from torch.nn.attention import SDPBackend, sdpa_kernel def try_backend(backend, head_dim, seq=256, heads=8, bs=1, dtype=torch.bfloat16): q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) try: with sdpa_kernel([backend]): F.scaled_dot_product_attention(q, k, v, is_causal=True) return "ACCEPT" except RuntimeError: return "REJECT" for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]: print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]}) ``` | Backend | hd=128 | hd=192 | hd=256 | |---|---|---|---| | CUDNN_ATTENTION | ACCEPT | **REJECT** | **REJECT** | | FLASH_ATTENTION | ACCEPT | ACCEPT | ACCEPT | | EFFICIENT_ATTENTION | ACCEPT | ACCEPT | ACCEPT | | MATH | ACCEPT | ACCEPT | ACCEPT | cuDNN error at rejection: `"head_dim should be no more than 128"` — exactly the warning from [`sdp_utils.cpp:522`](https://github.com/pytorch/pytorch/blob/release/2.11/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L522). ## Source analysis [`sdp_utils.cpp:504-525` (release/2.11)](https://github.com/pytorch/pytorch/blob/release/2.11/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L504-L525): ```cpp auto head_dim_limit = 128; // Hopper: head_dim = 9.10.0 if (cudnn_version >= 91000) { auto dprops = at::cuda::getCurrentDeviceProperties(); if (dprops->major == 9 && !dprops->minor) { // only sm_90 head_dim_limit = 256; } } // (sm_10.x DeepSeek head_dim=192 case — irrelevant for sm_120) ``` sm_120 never enters the `head_dim_limit = 256` branch, falls through with `= 128`. Since `check_cudnn_hardware_support` at [lines 635-656](https://github.com/pytorch/pytorch/blob/release/2.11/aten/src/ATen/native/transformers/cuda/sdp_utils.cpp#L635-L656) already accepts sm_120 (it's inside the `[sm_80, sm_121]` range), this is purely the shape check that's out of sync. ## Proposed fix ```diff auto head_dim_limit = 128; - // Hopper: head_dim = 9.10.0 + // Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090): + // head_dim = 9.10.0. if (cudnn_version >= 91000) { auto dprops = at::cuda::getCurrentDeviceProperties(); - if (dprops->major == 9 && !dprops->minor) { + if ((dprops->major == 9 && !dprops->minor) || + (dprops->major == 12 && dprops->minor == 0)) { head_dim_limit = 256; } } ``` ## Open questions for maintainers 1. **Does cuDNN 9.10+ actually handle head_dim=256 kernels on sm_120?** The pytorch gate is currently the only blocker per my reading, but if cuDNN rejects shape-wise independently on sm_120, lifting the pytorch gate would move the error from a readable warning to an opaque kernel-launch failure. 2. Is this a one-line patch you'd accept a PR for, or is there a broader Blackwell-shape-support effort in flight that we should wait on? 3. Any preferred test location? `test_transformers.py` seems to be where similar arch-gated checks live. ## First-time pytorch contributor disclosure This is my first issue on pytorch core. Apologies in advance for any style deviations. Happy to open a PR if the direction is right. --- 🤖 Drafted with [Claude Code](https://claude.com/claude-co

pytorch2026-04-24 14:56:22

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

pytorch/pytorch#181379•Fetched 2026-04-25 06:02:39

View on GitHub

Comments

Participants

Timeline

Reactions

Author

dvdimitrov13

Participants

dvdimitrov13

eqy

Timeline (top)

mentioned ×20subscribed ×20labeled ×7commented ×1

On consumer Blackwell (sm_120, RTX 5090), check_cudnn_tensor_shapes caps head_dim_limit at 128 even though check_cudnn_hardware_support already whitelists the hardware in the [sm_80, sm_121] range. This silently rejects cuDNN attention for any head_dim=256 model (Gemma 3, Qwen3 ≥14B, Llama 3.1 70B, etc.) and forces a fall-through to the MATH backend with its O(seq²) fp32 softmax — OOMs at long context.

Adjacent prior art: PR #172621 (Jan 2026, merged) did the same kind of lift for datacenter Blackwell (sm_10.0/10.3 DeepSeek head_dim=192). The workstation/consumer sm_120 line was simply not included.

Error Message

import torch import torch.nn.functional as F from torch.nn.attention import SDPBackend, sdpa_kernel

def try_backend(backend, head_dim, seq=256, heads=8, bs=1, dtype=torch.bfloat16): q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype) try: with sdpa_kernel([backend]): F.scaled_dot_product_attention(q, k, v, is_causal=True) return "ACCEPT" except RuntimeError: return "REJECT"

for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION, SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]: print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]})

Root Cause

Fix Action

Fix / Workaround

Is this a one-line patch you'd accept a PR for, or is there a broader Blackwell-shape-support effort in flight that we should wait on?

Code Example

import torch
import torch.nn.functional as F
from torch.nn.attention import SDPBackend, sdpa_kernel

def try_backend(backend, head_dim, seq=256, heads=8, bs=1,
                dtype=torch.bfloat16):
    q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    try:
        with sdpa_kernel([backend]):
            F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return "ACCEPT"
    except RuntimeError:
        return "REJECT"

for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION,
           SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]:
    print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]})

---

auto head_dim_limit = 128;
// Hopper: head_dim<=256 support with cuDNN >= 9.10.0
if (cudnn_version >= 91000) {
  auto dprops = at::cuda::getCurrentDeviceProperties();
  if (dprops->major == 9 && !dprops->minor) {       // only sm_90
    head_dim_limit = 256;
  }
}
// (sm_10.x DeepSeek head_dim=192 case — irrelevant for sm_120)

---

auto head_dim_limit = 128;
-  // Hopper: head_dim<=256 support with cuDNN >= 9.10.0
+  // Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090):
+  // head_dim<=256 support with cuDNN >= 9.10.0.
   if (cudnn_version >= 91000) {
     auto dprops = at::cuda::getCurrentDeviceProperties();
-    if (dprops->major == 9 && !dprops->minor) {
+    if ((dprops->major == 9 && !dprops->minor) ||
+        (dprops->major == 12 && dprops->minor == 0)) {
       head_dim_limit = 256;
     }
   }

RAW_BUFFERClick to expand / collapse

Summary

Empirical reproducer (RTX 5090, torch 2.8+cu129, cuDNN 9.10.02)

import torch
import torch.nn.functional as F
from torch.nn.attention import SDPBackend, sdpa_kernel

def try_backend(backend, head_dim, seq=256, heads=8, bs=1,
                dtype=torch.bfloat16):
    q = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    k = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    v = torch.randn(bs, heads, seq, head_dim, device="cuda", dtype=dtype)
    try:
        with sdpa_kernel([backend]):
            F.scaled_dot_product_attention(q, k, v, is_causal=True)
        return "ACCEPT"
    except RuntimeError:
        return "REJECT"

for be in [SDPBackend.CUDNN_ATTENTION, SDPBackend.FLASH_ATTENTION,
           SDPBackend.EFFICIENT_ATTENTION, SDPBackend.MATH]:
    print(be, {hd: try_backend(be, hd) for hd in [128, 192, 256]})

Backend	hd=128	hd=192	hd=256
CUDNN_ATTENTION	ACCEPT	REJECT	REJECT
FLASH_ATTENTION	ACCEPT	ACCEPT	ACCEPT
EFFICIENT_ATTENTION	ACCEPT	ACCEPT	ACCEPT
MATH	ACCEPT	ACCEPT	ACCEPT

cuDNN error at rejection: "head_dim should be no more than 128" — exactly the warning from sdp_utils.cpp:522.

Source analysis

sdp_utils.cpp:504-525 (release/2.11):

auto head_dim_limit = 128;
// Hopper: head_dim<=256 support with cuDNN >= 9.10.0
if (cudnn_version >= 91000) {
  auto dprops = at::cuda::getCurrentDeviceProperties();
  if (dprops->major == 9 && !dprops->minor) {       // only sm_90
    head_dim_limit = 256;
  }
}
// (sm_10.x DeepSeek head_dim=192 case — irrelevant for sm_120)

sm_120 never enters the head_dim_limit = 256 branch, falls through with = 128. Since check_cudnn_hardware_support at lines 635-656 already accepts sm_120 (it's inside the [sm_80, sm_121] range), this is purely the shape check that's out of sync.

Proposed fix

   auto head_dim_limit = 128;
-  // Hopper: head_dim<=256 support with cuDNN >= 9.10.0
+  // Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090):
+  // head_dim<=256 support with cuDNN >= 9.10.0.
   if (cudnn_version >= 91000) {
     auto dprops = at::cuda::getCurrentDeviceProperties();
-    if (dprops->major == 9 && !dprops->minor) {
+    if ((dprops->major == 9 && !dprops->minor) ||
+        (dprops->major == 12 && dprops->minor == 0)) {
       head_dim_limit = 256;
     }
   }

Open questions for maintainers

Does cuDNN 9.10+ actually handle head_dim=256 kernels on sm_120? The pytorch gate is currently the only blocker per my reading, but if cuDNN rejects shape-wise independently on sm_120, lifting the pytorch gate would move the error from a readable warning to an opaque kernel-launch failure.
Is this a one-line patch you'd accept a PR for, or is there a broader Blackwell-shape-support effort in flight that we should wait on?
Any preferred test location? test_transformers.py seems to be where similar arch-gated checks live.

First-time pytorch contributor disclosure

This is my first issue on pytorch core. Apologies in advance for any style deviations. Happy to open a PR if the direction is right.

🤖 Drafted with Claude Code (Claude Opus 4.7), reviewed and posted by me.

cc @csarofeen @ptrblck @eqy @nWEIdia @msaroufim @jerryzh168 @tinglvv @drisspg @liangel-02 @howardzhang-cv

extent analysis

TL;DR

The proposed fix involves updating the head_dim_limit check in sdp_utils.cpp to include support for sm_120 architecture with cuDNN version 9.10 or higher.

Guidance

Review the proposed fix and verify that it correctly updates the head_dim_limit for sm_120 architecture.
Test the updated code with the provided empirical reproducer to ensure that it resolves the issue.
Consider adding additional tests to cover this scenario, potentially in test_transformers.py.
Before merging the fix, confirm with maintainers that cuDNN 9.10+ actually handles head_dim=256 kernels on sm_120.

Example

The proposed fix is provided as a diff:

auto head_dim_limit = 128;
// Hopper (sm_90) and Blackwell consumer (sm_120, RTX 5090):
// head_dim<=256 support with cuDNN >= 9.10.0.
if (cudnn_version >= 91000) {
  auto dprops = at::cuda::getCurrentDeviceProperties();
  if ((dprops->major == 9 && !dprops->minor) ||
      (dprops->major == 12 && dprops->minor == 0)) {
    head_dim_limit = 256;
  }
}

Notes

The fix assumes that cuDNN 9.10+ supports head_dim=256 kernels on sm_120. If this is not the case, the fix may need to be revised.

Recommendation

Apply the proposed workaround by updating the head_dim_limit check in sdp_utils.cpp to include support for sm_120 architecture with cuDNN version 9.10 or higher, after confirming with maintainers that cuDNN 9.10+ actually handles head_dim=256 kernels on `sm_

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

pytorch - 💡(How to fix) Fix [cuDNN][SDPA] head_dim limit stuck at 128 on sm_120 (RTX 5090), blocking head_dim=256 models [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Empirical reproducer (RTX 5090, torch 2.8+cu129, cuDNN 9.10.02)

Source analysis

Proposed fix

Open questions for maintainers

First-time pytorch contributor disclosure

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

pytorch - 💡(How to fix) Fix [cuDNN][SDPA] head_dim limit stuck at 128 on sm_120 (RTX 5090), blocking head_dim=256 models [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Summary

Empirical reproducer (RTX 5090, torch 2.8+cu129, cuDNN 9.10.02)

Source analysis

Proposed fix

Open questions for maintainers

First-time pytorch contributor disclosure

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING