<p cid="n71" mdtype="paragraph" class="md-end-block md-p md-focus" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Qwen3VLVisionPatchEmbed.forward should run in ~0.3 ms (the time of the equivalent nn.Linear ), not ~16 s. <h3 cid="n72" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Proposed fix <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="diff" cid="n73" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> class Qwen3VLVisionPatchEmbed(nn.Module): def __init__(self, config) -> None: super().__init__() self.patch_size = config.patch_size self.temporal_patch_size = config.temporal_patch_size self.in_channels = config.in_channels self.embed_dim = config.hidden_size - kernel_size = [self.temporal_patch_size, self.patch_size, self.patch_size] - self.proj = nn.Conv3d( - self.in_channels, self.embed_dim, - kernel_size=kernel_size, stride=kernel_size, bias=True, - ) + in_dim = (self.in_channels * self.temporal_patch_size + * self.patch_size * self.patch_size) + self.proj = nn.Linear(in_dim, self.embed_dim, bias=True) def forward(self, hidden_states): target_dtype = self.proj.weight.dtype - hidden_states = hidden_states.view( - -1, self.in_channels, self.temporal_patch_size, - self.patch_size, self.patch_size, - ) - hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim) + hidden_states = hidden_states.reshape(-1, self.proj.in_features) + hidden_states = self.proj(hidden_states.to(dtype=target_dtype)) return hidden_states <h3 cid="n74" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Backward compatibility for existing checkpoints <p cid="n75" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Pretrained Qwen3-VL-*-Instruct checkpoints save proj.weight in 5-D Conv3d shape (out, in, k_t, k_h, k_w) . To load them into the new Linear layer ( (out, in·k_t·k_h·k_w) ), add a _load_from_state_dict hook: <pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="python" cid="n76" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> def _load_from_state_dict ( self , state_dict , prefix , * args , ** kwargs ): key = prefix + "proj.weight" if key in state_dict and state_dict [ key ]. dim () == 5 : out_dim = state_dict [ key ]. shape [ 0 ] state_dict [ key ] = state_dict [ key ]. reshape ( out_dim , - 1 ). contiguous () super (). _load_from_state_dict ( state_dict , prefix , * args , ** kwargs ) <p cid="n77" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> This makes the change transparent to existing public checkpoints. <h3 cid="n78" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Numerical equivalence verified <figure class="md-table-fig table-figure" cid="n79" mdtype="table" style="box-sizing: border-box; margin: 1.2em 0px; overflow-x: auto; max-width: calc(100% + 16px); padding: 0px; cursor: default; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> check | tolerance | observed -- | -- | -- fp32 max abs diff (proj output) | 0.999 | 0.9995 bf16 cosine similarity (full 24-layer vision tower) | > 0.99 | > 0.999 per sample <h3 cid="n96" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Same fix applies to Qwen2-VL and Qwen2.5-VL <p cid="n97" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> The same Conv3d -with- kernel_size == stride pattern exists in Qwen2VLVisionPatchEmbed and Qwen2_5_VLVisionPatchEmbed . Both should be patched identically. <h3 cid="n98" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Why this matters <p cid="n99" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Anyone running Qwen-VL inference on a Blackwell GPU in bf16 silently pays a ~50,000× cost on the patch projection. For 30,000-sample feature extraction, this is the difference between 6 days and ~2 hours . <p cid="n100" mdtype="paragraph" class="md-end-block md-p md-focus" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> Happy to send a PR with the rewrite, the backward-compat _load_from_state_dict hook, a unit test, and a benchmark script.

transformers - ✅(Solved) Fix `Qwen3VLVisionPatchEmbed.proj` (`nn.Conv3d` with `stride == kernel`) is ~50,000× slower than equivalent `nn.Linear` on Blackwell + bf16 [1 pull requests, 1 participants]

transformers2026-05-03 07:27:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#45750•Fetched 2026-05-04 04:58:16

View on GitHub

Comments

Participants

Timeline

Reactions

Author

WangYuHang-cmd

Participants

WangYuHang-cmd

Timeline (top)

mentioned ×3subscribed ×3cross-referenced ×1labeled ×1

Fix Action

Fix / Workaround

torch.cuda.synchronize(); t0 = time.time() h = vt.patch_embed(pv); torch.cuda.synchronize() print(f"patch_embed: {(time.time()-t0)*1000:.1f} ms, shape={tuple(h.shape)}")

   patch_embed:  16111.3 ms, shape=(6080, 1024)   ← 96% of total forward
   pos_embed:    22.8 ms
   rot_pos_emb:  20.7 ms
   24 blocks total: 56.4 ms (mean 2.3 ms)
   merger:       0.5 ms

The 24-layer ViT runs in 56 ms total. The single patch_embed takes 16,111 ms — 287× more than the rest combined.

PR fix notes

PR #45771: perf(qwen3_vl): replace Conv3d with F.linear in patch embed forward

Repository: huggingface/transformers
Author: jashshah999
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/45771

Description (problem / solution / changelog)

What does this PR do?

Replaces the Conv3d forward pass in Qwen3VLVisionPatchEmbed with F.linear on the reshaped weight. When stride == kernel_size, Conv3d is mathematically equivalent to extracting non-overlapping patches and applying a linear projection. The Conv3d codepath triggers an extremely slow cuDNN kernel on some GPU/dtype combinations (~50,000x slower on Blackwell + bf16, ~62x on other configs per the issue benchmarks).

The fix reshapes the input to (batch, in_channels * t * h * w) and uses F.linear(input, weight.view(embed_dim, -1), bias). Same weight tensor, just reshaped at forward time, so existing checkpoints load without changes.

Before submitting

Did you read the contributor guideline?
This PR fixes a bug (issue #45750)
Backward compatible (same weights, same outputs, just faster)

Fixes #45750.

Changed files

src/transformers/models/qwen3_vl/modeling_qwen3_vl.py (modified, +8/-3)

Code Example

transformers version: 5.0.0.dev0
PyTorch:              2.9.0+cu128
CUDA:                 12.8
cuDNN:                9.10.0.2 (91002)
Python:               3.14.0
flash-attn:           2.8.3 (installed)
GPU:                  NVIDIA GeForce RTX 5090 (Blackwell, compute capability 12.0, sm_120)
OS:                   Linux 6.8.0-110-generic, glibc 2.39

---

import torch, time
   for size in [4096, 8192]:
       a = torch.randn(size, size, dtype=torch.bfloat16, device="cuda")
       b = torch.randn(size, size, dtype=torch.bfloat16, device="cuda")
       for _ in range(3): _ = a @ b
       torch.cuda.synchronize(); t0 = time.time()
       for _ in range(10): c = a @ b
       torch.cuda.synchronize()
       e = time.time() - t0
       print(f"matmul {size}x{size}: {2 * size**3 * 10 / e / 1e12:.1f} TFLOPS")

---

matmul 4096x4096: 182.3 TFLOPS
   matmul 8192x8192: 223.7 TFLOPS

---

import time, torch
   from PIL import Image
   from transformers import AutoModelForImageTextToText, AutoProcessor
   
   torch.set_grad_enabled(False)
   proc = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
   model = AutoModelForImageTextToText.from_pretrained(
       "Qwen/Qwen3-VL-4B-Instruct", dtype=torch.bfloat16,
   ).cuda().eval()
   
   def make_clip():
       return [Image.fromarray(torch.randint(0, 256, (720, 1280, 3),
               dtype=torch.uint8).numpy()) for _ in range(8)]
   
   def time_forward(bs):
       texts, images = [], []
       for _ in range(bs):
           frames = make_clip()
           msgs = [{"role": "user", "content":
                    [{"type": "image", "image": img} for img in frames]
                    + [{"type": "text", "text": "Describe."}]}]
           texts.append(proc.apply_chat_template(
               msgs, tokenize=False, add_generation_prompt=True))
           images.append(frames)
       inputs = proc(text=texts, images=images,
                     return_tensors="pt", padding=True)
       inputs = {k: (v.cuda() if isinstance(v, torch.Tensor) else v)
                 for k, v in inputs.items()}
       inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
       keys = ("input_ids","attention_mask","pixel_values","image_grid_thw")
       args = {k: inputs[k] for k in keys if k in inputs}
       for rep in range(2):
           torch.cuda.synchronize(); t0 = time.time()
           with torch.amp.autocast("cuda", dtype=torch.bfloat16):
               _ = model.model(**args, use_cache=False, return_dict=True)
           torch.cuda.synchronize()
           e = time.time() - t0
           print(f"  bs={bs} rep={rep}: {e:.2f}s ({e/bs*1000:.0f} ms/sample)")
   
   for bs in [1, 4, 8, 16]: time_forward(bs)

---

bs=1  rep=0: 16.70s (16700 ms/sample)
   bs=1  rep=1: 16.46s (16458 ms/sample)
   bs=4  rep=0: 65.30s (16325 ms/sample)
   bs=8  rep=0: 148.05s (18506 ms/sample)
   bs=8  rep=1: 148.01s (18501 ms/sample)
   bs=16 rep=0: 148.78s (9299 ms/sample)
   bs=16 rep=1: 148.34s (9271 ms/sample)

---

for impl in ["sdpa", "flash_attention_2", "eager"]:
       model = AutoModelForImageTextToText.from_pretrained(
           "Qwen/Qwen3-VL-4B-Instruct", dtype=torch.bfloat16,
           attn_implementation=impl).cuda().eval()
       # ... same time_forward(bs=8) ...

---

sdpa              bs=8: 148.05s (18506 ms/sample)
   flash_attention_2 bs=8: 147.64s (18455 ms/sample)
   eager             bs=8: 148.20s (18525 ms/sample)

---

import torch.nn.functional as F
   pv = inputs["pixel_values"]; grid_thw = inputs["image_grid_thw"]
   vt = model.visual
   
   torch.cuda.synchronize(); t0 = time.time()
   h = vt.patch_embed(pv); torch.cuda.synchronize()
   print(f"patch_embed:  {(time.time()-t0)*1000:.1f} ms, shape={tuple(h.shape)}")
   
   t0 = time.time(); pos = vt.fast_pos_embed_interpolate(grid_thw)
   torch.cuda.synchronize(); print(f"pos_embed:    {(time.time()-t0)*1000:.1f} ms")
   h = h + pos
   
   t0 = time.time(); rope = vt.rot_pos_emb(grid_thw)
   torch.cuda.synchronize(); print(f"rot_pos_emb:  {(time.time()-t0)*1000:.1f} ms")
   
   seq_len = h.size(0); h = h.reshape(seq_len, -1)
   rope = rope.reshape(seq_len, -1)
   emb = torch.cat((rope, rope), dim=-1)
   pos_emb = (emb.cos(), emb.sin())
   cu = torch.repeat_interleave(grid_thw[:,1]*grid_thw[:,2],
        grid_thw[:,0]).cumsum(0, dtype=torch.int32)
   cu = F.pad(cu, (1,0), value=0)
   
   times = []
   for i, blk in enumerate(vt.blocks):
       torch.cuda.synchronize(); t0 = time.time()
       h = blk(h, cu_seqlens=cu, position_embeddings=pos_emb)
       torch.cuda.synchronize()
       times.append((time.time()-t0)*1000)
   print(f"24 blocks total: {sum(times):.1f} ms (mean {sum(times)/24:.1f} ms)")
   
   t0 = time.time(); _ = vt.merger(h)
   torch.cuda.synchronize(); print(f"merger:       {(time.time()-t0)*1000:.1f} ms")

---

patch_embed:  16111.3 ms, shape=(6080, 1024)   ← 96% of total forward
   pos_embed:    22.8 ms
   rot_pos_emb:  20.7 ms
   24 blocks total: 56.4 ms (mean 2.3 ms)
   merger:       0.5 ms

---

class Qwen3VLVisionPatchEmbed(nn.Module):
       def __init__(self, config) -> None:
           super().__init__()
           self.patch_size = config.patch_size                  # 16
           self.temporal_patch_size = config.temporal_patch_size  # 2
           self.in_channels = config.in_channels                 # 3
           self.embed_dim = config.hidden_size                   # 1024
           kernel_size = [self.temporal_patch_size, self.patch_size, self.patch_size]
           self.proj = nn.Conv3d(
               self.in_channels, self.embed_dim,
               kernel_size=kernel_size, stride=kernel_size, bias=True,
           )
   
       def forward(self, hidden_states):
           target_dtype = self.proj.weight.dtype
           hidden_states = hidden_states.view(
               -1, self.in_channels, self.temporal_patch_size,
               self.patch_size, self.patch_size,
           )
           hidden_states = self.proj(
               hidden_states.to(dtype=target_dtype)
           ).view(-1, self.embed_dim)
           return hidden_states

---

import time, torch, torch.nn as nn
   torch.set_grad_enabled(False)
   
   conv = nn.Conv3d(3, 1024, kernel_size=(2,16,16),
                    stride=(2,16,16), bias=True).cuda().to(torch.bfloat16)
   
   out_dim, in_dim = 1024, 3*2*16*16  # 1536
   lin = nn.Linear(in_dim, out_dim, bias=True)
   lin.weight.data.copy_(conv.weight.detach().reshape(out_dim, in_dim))
   lin.bias.data.copy_(conv.bias.detach())
   lin = lin.cuda().to(torch.bfloat16)
   
   N = 6080  # patches in one 8-frame Qwen3-VL clip
   x_5d  = torch.randn(N, 3, 2, 16, 16, dtype=torch.bfloat16, device="cuda")
   x_flat = x_5d.reshape(N, -1).contiguous()
   
   for _ in range(3): _ = conv(x_5d); _ = lin(x_flat)
   
   torch.cuda.synchronize(); t0 = time.time()
   for _ in range(5): y_conv = conv(x_5d).view(N, -1)
   torch.cuda.synchronize(); t_conv = (time.time()-t0)/5
   
   torch.cuda.synchronize(); t0 = time.time()
   for _ in range(5): y_lin = lin(x_flat)
   torch.cuda.synchronize(); t_lin = (time.time()-t0)/5
   
   print(f"Conv3d:  {t_conv*1000:8.2f} ms")
   print(f"Linear:  {t_lin*1000:8.2f} ms")
   print(f"Speedup: {t_conv/t_lin:8.1f}x")
   diff = (y_conv.float() - y_lin.float()).abs().max().item()
   cos = torch.nn.functional.cosine_similarity(
       y_conv.float().flatten().unsqueeze(0),
       y_lin.float().flatten().unsqueeze(0)).item()
   print(f"max abs diff (bf16): {diff:.2e}")
   print(f"cosine similarity:   {cos:.6f}")

---

Conv3d:    16111.30 ms
   Linear:        0.30 ms
   Speedup:    53704.3x
   max abs diff (bf16): 1.56e-02
   cosine similarity:   0.999500

---

import torch, torch.nn as nn
torch.manual_seed(0); N = 100; C, T, P = 3, 2, 16; out_dim = 1024
conv = nn.Conv3d(C, out_dim, (T,P,P), stride=(T,P,P), bias=True)
in_dim = C*T*P*P
lin = nn.Linear(in_dim, out_dim, bias=True)
lin.weight.data.copy_(conv.weight.detach().reshape(out_dim, in_dim))
lin.bias.data.copy_(conv.bias.detach())

x_5d = torch.randn(N, C, T, P, P, dtype=torch.float32)
x_flat = x_5d.reshape(N, -1).contiguous()
with torch.no_grad():
    o_conv = conv(x_5d).view(N, -1)
    o_lin  = lin(x_flat)

abs_diff = (o_conv - o_lin).abs()
print(f"fp32 max abs diff:  {abs_diff.max().item():.2e}")
print(f"fp32 mean abs diff: {abs_diff.mean().item():.2e}")

---

fp32 max abs diff:  4.77e-07
fp32 mean abs diff: 7.61e-08

---

import time, torch, torch.nn as nn
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.models.qwen3_vl.modeling_qwen3_vl import (
    Qwen3VLVisionPatchEmbed,
)

def _fast_forward(self, hidden_states):
    target_dtype = self.proj.weight.dtype
    if isinstance(self.proj, nn.Conv3d):
        conv = self.proj
        out_dim = conv.out_channels
        in_dim = (conv.in_channels * conv.kernel_size[0]
                  * conv.kernel_size[1] * conv.kernel_size[2])
        w_flat = conv.weight.detach().reshape(out_dim, in_dim).contiguous()
        bias = conv.bias.detach().clone() if conv.bias is not None else None
        new_proj = nn.Linear(in_dim, out_dim, bias=bias is not None)
        new_proj.weight.data.copy_(w_flat)
        if bias is not None: new_proj.bias.data.copy_(bias)
        new_proj.to(device=conv.weight.device, dtype=conv.weight.dtype)
        self.proj = new_proj
    if hidden_states.dim() > 2 \
            or hidden_states.shape[-1] != self.proj.in_features:
        hidden_states = hidden_states.reshape(-1, self.proj.in_features)
    return self.proj(hidden_states.to(dtype=target_dtype))

Qwen3VLVisionPatchEmbed.forward = _fast_forward

# Reload model and run step-2-style timing again.

---

bs=1  rep=0: 0.27s (270 ms/sample)
bs=1  rep=1: 0.29s (290 ms/sample)
bs=8  rep=0: 2.16s (270 ms/sample)
bs=8  rep=1: 2.18s (273 ms/sample)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.0.0.dev0
PyTorch:              2.9.0+cu128
CUDA:                 12.8
cuDNN:                9.10.0.2 (91002)
Python:               3.14.0
flash-attn:           2.8.3 (installed)
GPU:                  NVIDIA GeForce RTX 5090 (Blackwell, compute capability 12.0, sm_120)
OS:                   Linux 6.8.0-110-generic, glibc 2.39

Who can help?

@yonigozlan @molbap @zucchini-nlp

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Confirm GPU is healthy. RTX 5090 should hit ~100–209 TFLOPS bf16 dense matmul.

import torch, time
for size in [4096, 8192]:
    a = torch.randn(size, size, dtype=torch.bfloat16, device="cuda")
    b = torch.randn(size, size, dtype=torch.bfloat16, device="cuda")
    for _ in range(3): _ = a @ b
    torch.cuda.synchronize(); t0 = time.time()
    for _ in range(10): c = a @ b
    torch.cuda.synchronize()
    e = time.time() - t0
    print(f"matmul {size}x{size}: {2 * size**3 * 10 / e / 1e12:.1f} TFLOPS")

Output:

matmul 4096x4096: 182.3 TFLOPS
matmul 8192x8192: 223.7 TFLOPS

Hardware is fine.

Run a full vision-tower forward at multiple batch sizes, all with default settings.

import time, torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor

torch.set_grad_enabled(False)
proc = AutoProcessor.from_pretrained("Qwen/Qwen3-VL-4B-Instruct")
model = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3-VL-4B-Instruct", dtype=torch.bfloat16,
).cuda().eval()

def make_clip():
    return [Image.fromarray(torch.randint(0, 256, (720, 1280, 3),
            dtype=torch.uint8).numpy()) for _ in range(8)]

def time_forward(bs):
    texts, images = [], []
    for _ in range(bs):
        frames = make_clip()
        msgs = [{"role": "user", "content":
                 [{"type": "image", "image": img} for img in frames]
                 + [{"type": "text", "text": "Describe."}]}]
        texts.append(proc.apply_chat_template(
            msgs, tokenize=False, add_generation_prompt=True))
        images.append(frames)
    inputs = proc(text=texts, images=images,
                  return_tensors="pt", padding=True)
    inputs = {k: (v.cuda() if isinstance(v, torch.Tensor) else v)
              for k, v in inputs.items()}
    inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)
    keys = ("input_ids","attention_mask","pixel_values","image_grid_thw")
    args = {k: inputs[k] for k in keys if k in inputs}
    for rep in range(2):
        torch.cuda.synchronize(); t0 = time.time()
        with torch.amp.autocast("cuda", dtype=torch.bfloat16):
            _ = model.model(**args, use_cache=False, return_dict=True)
        torch.cuda.synchronize()
        e = time.time() - t0
        print(f"  bs={bs} rep={rep}: {e:.2f}s ({e/bs*1000:.0f} ms/sample)")

for bs in [1, 4, 8, 16]: time_forward(bs)

Output:

bs=1  rep=0: 16.70s (16700 ms/sample)
bs=1  rep=1: 16.46s (16458 ms/sample)
bs=4  rep=0: 65.30s (16325 ms/sample)
bs=8  rep=0: 148.05s (18506 ms/sample)
bs=8  rep=1: 148.01s (18501 ms/sample)
bs=16 rep=0: 148.78s (9299 ms/sample)
bs=16 rep=1: 148.34s (9271 ms/sample)

Per-sample time is ~16 s regardless of batch — rules out DataLoader, collate, padding bugs.

Eliminate attn_implementation as the cause. Test all three.

for impl in ["sdpa", "flash_attention_2", "eager"]:
    model = AutoModelForImageTextToText.from_pretrained(
        "Qwen/Qwen3-VL-4B-Instruct", dtype=torch.bfloat16,
        attn_implementation=impl).cuda().eval()
    # ... same time_forward(bs=8) ...

Output:

sdpa              bs=8: 148.05s (18506 ms/sample)
flash_attention_2 bs=8: 147.64s (18455 ms/sample)
eager             bs=8: 148.20s (18525 ms/sample)

All three implementations are identically slow → attention is not the cause.

Per-component timing of Qwen3VLVisionModel.forward (bs=1, 8 frames).

import torch.nn.functional as F
pv = inputs["pixel_values"]; grid_thw = inputs["image_grid_thw"]
vt = model.visual

torch.cuda.synchronize(); t0 = time.time()
h = vt.patch_embed(pv); torch.cuda.synchronize()
print(f"patch_embed:  {(time.time()-t0)*1000:.1f} ms, shape={tuple(h.shape)}")

t0 = time.time(); pos = vt.fast_pos_embed_interpolate(grid_thw)
torch.cuda.synchronize(); print(f"pos_embed:    {(time.time()-t0)*1000:.1f} ms")
h = h + pos

t0 = time.time(); rope = vt.rot_pos_emb(grid_thw)
torch.cuda.synchronize(); print(f"rot_pos_emb:  {(time.time()-t0)*1000:.1f} ms")

seq_len = h.size(0); h = h.reshape(seq_len, -1)
rope = rope.reshape(seq_len, -1)
emb = torch.cat((rope, rope), dim=-1)
pos_emb = (emb.cos(), emb.sin())
cu = torch.repeat_interleave(grid_thw[:,1]*grid_thw[:,2],
     grid_thw[:,0]).cumsum(0, dtype=torch.int32)
cu = F.pad(cu, (1,0), value=0)

times = []
for i, blk in enumerate(vt.blocks):
    torch.cuda.synchronize(); t0 = time.time()
    h = blk(h, cu_seqlens=cu, position_embeddings=pos_emb)
    torch.cuda.synchronize()
    times.append((time.time()-t0)*1000)
print(f"24 blocks total: {sum(times):.1f} ms (mean {sum(times)/24:.1f} ms)")

t0 = time.time(); _ = vt.merger(h)
torch.cuda.synchronize(); print(f"merger:       {(time.time()-t0)*1000:.1f} ms")

Output:

patch_embed:  16111.3 ms, shape=(6080, 1024)   ← 96% of total forward
pos_embed:    22.8 ms
rot_pos_emb:  20.7 ms
24 blocks total: 56.4 ms (mean 2.3 ms)
merger:       0.5 ms

The 24-layer ViT runs in 56 ms total. The single patch_embed takes 16,111 ms — 287× more than the rest combined.

Inspect Qwen3VLVisionPatchEmbed (file: transformers/models/qwen3_vl/modeling_qwen3_vl.py, lines 59–76).

class Qwen3VLVisionPatchEmbed(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        self.patch_size = config.patch_size                  # 16
        self.temporal_patch_size = config.temporal_patch_size  # 2
        self.in_channels = config.in_channels                 # 3
        self.embed_dim = config.hidden_size                   # 1024
        kernel_size = [self.temporal_patch_size, self.patch_size, self.patch_size]
        self.proj = nn.Conv3d(
            self.in_channels, self.embed_dim,
            kernel_size=kernel_size, stride=kernel_size, bias=True,
        )

    def forward(self, hidden_states):
        target_dtype = self.proj.weight.dtype
        hidden_states = hidden_states.view(
            -1, self.in_channels, self.temporal_patch_size,
            self.patch_size, self.patch_size,
        )
        hidden_states = self.proj(
            hidden_states.to(dtype=target_dtype)
        ).view(-1, self.embed_dim)
        return hidden_states

kernel_size == stride, no padding, no dilation → output windows are disjoint → mathematically equivalent to flatten + nn.Linear.

Isolated benchmark: Conv3d vs equivalent Linear (no checkpoint needed).

import time, torch, torch.nn as nn
torch.set_grad_enabled(False)

conv = nn.Conv3d(3, 1024, kernel_size=(2,16,16),
                 stride=(2,16,16), bias=True).cuda().to(torch.bfloat16)

out_dim, in_dim = 1024, 3*2*16*16  # 1536
lin = nn.Linear(in_dim, out_dim, bias=True)
lin.weight.data.copy_(conv.weight.detach().reshape(out_dim, in_dim))
lin.bias.data.copy_(conv.bias.detach())
lin = lin.cuda().to(torch.bfloat16)

N = 6080  # patches in one 8-frame Qwen3-VL clip
x_5d  = torch.randn(N, 3, 2, 16, 16, dtype=torch.bfloat16, device="cuda")
x_flat = x_5d.reshape(N, -1).contiguous()

for _ in range(3): _ = conv(x_5d); _ = lin(x_flat)

torch.cuda.synchronize(); t0 = time.time()
for _ in range(5): y_conv = conv(x_5d).view(N, -1)
torch.cuda.synchronize(); t_conv = (time.time()-t0)/5

torch.cuda.synchronize(); t0 = time.time()
for _ in range(5): y_lin = lin(x_flat)
torch.cuda.synchronize(); t_lin = (time.time()-t0)/5

print(f"Conv3d:  {t_conv*1000:8.2f} ms")
print(f"Linear:  {t_lin*1000:8.2f} ms")
print(f"Speedup: {t_conv/t_lin:8.1f}x")
diff = (y_conv.float() - y_lin.float()).abs().max().item()
cos = torch.nn.functional.cosine_similarity(
    y_conv.float().flatten().unsqueeze(0),
    y_lin.float().flatten().unsqueeze(0)).item()
print(f"max abs diff (bf16): {diff:.2e}")
print(f"cosine similarity:   {cos:.6f}")

Output:

Conv3d:    16111.30 ms
Linear:        0.30 ms
Speedup:    53704.3x
max abs diff (bf16): 1.56e-02
cosine similarity:   0.999500

Verify mathematical equivalence in fp32 (rules out numerical accident).

import torch, torch.nn as nn
torch.manual_seed(0); N = 100; C, T, P = 3, 2, 16; out_dim = 1024
conv = nn.Conv3d(C, out_dim, (T,P,P), stride=(T,P,P), bias=True)
in_dim = C*T*P*P
lin = nn.Linear(in_dim, out_dim, bias=True)
lin.weight.data.copy_(conv.weight.detach().reshape(out_dim, in_dim))
lin.bias.data.copy_(conv.bias.detach())

x_5d = torch.randn(N, C, T, P, P, dtype=torch.float32)
x_flat = x_5d.reshape(N, -1).contiguous()
with torch.no_grad():
    o_conv = conv(x_5d).view(N, -1)
    o_lin  = lin(x_flat)

abs_diff = (o_conv - o_lin).abs()
print(f"fp32 max abs diff:  {abs_diff.max().item():.2e}")
print(f"fp32 mean abs diff: {abs_diff.mean().item():.2e}")

Output:

fp32 max abs diff:  4.77e-07
fp32 mean abs diff: 7.61e-08

Conv3d with kernel == stride is exactly equivalent to Linear over reshaped weights — fp32 difference is single-multiplication round-off (~5e-7).

Apply the fix (lazy in-place Conv3d → Linear via monkey-patch on Qwen3VLVisionPatchEmbed.forward) and re-benchmark.

import time, torch, torch.nn as nn
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
from transformers.models.qwen3_vl.modeling_qwen3_vl import (
    Qwen3VLVisionPatchEmbed,
)

def _fast_forward(self, hidden_states):
    target_dtype = self.proj.weight.dtype
    if isinstance(self.proj, nn.Conv3d):
        conv = self.proj
        out_dim = conv.out_channels
        in_dim = (conv.in_channels * conv.kernel_size[0]
                  * conv.kernel_size[1] * conv.kernel_size[2])
        w_flat = conv.weight.detach().reshape(out_dim, in_dim).contiguous()
        bias = conv.bias.detach().clone() if conv.bias is not None else None
        new_proj = nn.Linear(in_dim, out_dim, bias=bias is not None)
        new_proj.weight.data.copy_(w_flat)
        if bias is not None: new_proj.bias.data.copy_(bias)
        new_proj.to(device=conv.weight.device, dtype=conv.weight.dtype)
        self.proj = new_proj
    if hidden_states.dim() > 2 \
            or hidden_states.shape[-1] != self.proj.in_features:
        hidden_states = hidden_states.reshape(-1, self.proj.in_features)
    return self.proj(hidden_states.to(dtype=target_dtype))

Qwen3VLVisionPatchEmbed.forward = _fast_forward

# Reload model and run step-2-style timing again.

Output:

bs=1  rep=0: 0.27s (270 ms/sample)
bs=1  rep=1: 0.29s (290 ms/sample)
bs=8  rep=0: 2.16s (270 ms/sample)
bs=8  rep=1: 2.18s (273 ms/sample)

Speedup vs step 2: 62× at bs=1, 68× at bs=8. VRAM unchanged. Patch embedding goes from 96% of total forward time to <1%.

Expected behavior

<!DOCTYPE html><p cid="n71" mdtype="paragraph" class="md-end-block md-p md-focus" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"><code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Qwen3VLVisionPatchEmbed.forward</code> should run in ~0.3 ms (the time of the

equivalent <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">nn.Linear</code>), not ~16 s.<h3 cid="n72" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Proposed fix</h3><pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="diff" cid="n73" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;"> class Qwen3VLVisionPatchEmbed(nn.Module): def init(self, config) -> None: super().init() self.patch_size = config.patch_size self.temporal_patch_size = config.temporal_patch_size self.in_channels = config.in_channels self.embed_dim = config.hidden_size - kernel_size = [self.temporal_patch_size, self.patch_size, self.patch_size] - self.proj = nn.Conv3d( - self.in_channels, self.embed_dim, - kernel_size=kernel_size, stride=kernel_size, bias=True, - ) + in_dim = (self.in_channels * self.temporal_patch_size + * self.patch_size * self.patch_size) + self.proj = nn.Linear(in_dim, self.embed_dim, bias=True) def forward(self, hidden_states): target_dtype = self.proj.weight.dtype - hidden_states = hidden_states.view( - -1, self.in_channels, self.temporal_patch_size, - self.patch_size, self.patch_size, - ) - hidden_states = self.proj(hidden_states.to(dtype=target_dtype)).view(-1, self.embed_dim) + hidden_states = hidden_states.reshape(-1, self.proj.in_features) + hidden_states = self.proj(hidden_states.to(dtype=target_dtype)) return hidden_states</pre><h3 cid="n74" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Backward compatibility for existing checkpoints</h3><p cid="n75" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Pretrained <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Qwen3-VL--Instruct</code> checkpoints save <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">proj.weight</code> in 5-D <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Conv3d</code> shape <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">(out, in, k_t, k_h, k_w)</code>. To load them into the new <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Linear</code> layer (<code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">(out, in·k_t·k_h·k_w)</code>), add a <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">_load_from_state_dict</code> hook:<pre class="md-fences md-end-block ty-contain-cm modeLoaded" spellcheck="false" lang="python" cid="n76" mdtype="fences" style="box-sizing: border-box; overflow: visible; font-family: var(--monospace); font-size: 0.9em; display: block; break-inside: avoid; text-align: left; white-space: pre; background-image: inherit; background-position: inherit; background-size: inherit; background-repeat: inherit; background-attachment: inherit; background-origin: inherit; background-clip: inherit; background-color: rgb(248, 248, 248); position: relative; border: 1px solid rgb(231, 234, 237); border-radius: 3px; padding: 8px 4px 6px; margin-bottom: 15px; margin-top: 15px; width: inherit; color: rgb(51, 51, 51); font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">def _load_from_state_dict(self, state_dict, prefix, args, kwargs): key = prefix + "proj.weight" if key in state_dict and state_dict[key].dim() == 5: out_dim = state_dict[key].shape[0] state_dict[key] = state_dict[key].reshape(out_dim, -1).contiguous() super()._load_from_state_dict(state_dict, prefix, *args, kwargs)</pre><p cid="n77" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">This makes the change transparent to existing public checkpoints.<h3 cid="n78" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Numerical equivalence verified</h3><figure class="md-table-fig table-figure" cid="n79" mdtype="table" style="box-sizing: border-box; margin: 1.2em 0px; overflow-x: auto; max-width: calc(100% + 16px); padding: 0px; cursor: default; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; white-space: normal; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">

check	tolerance	observed
fp32 max abs diff (proj output)	< 1e-5	< 1e-7
bf16 cosine similarity (proj output)	> 0.999	0.9995
bf16 cosine similarity (full 24-layer vision tower)	> 0.99	> 0.999 per sample

</figure><h3 cid="n96" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Same fix applies to Qwen2-VL and Qwen2.5-VL</h3><p cid="n97" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">The same <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Conv3d</code>-with-<code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">kernel_size == stride</code> pattern exists in <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Qwen2VLVisionPatchEmbed</code> and <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">Qwen2_5_VLVisionPatchEmbed</code>. Both should be patched identically.<h3 cid="n98" mdtype="heading" class="md-end-block md-heading" style="box-sizing: border-box; white-space: pre-wrap; break-after: avoid-page; break-inside: avoid; orphans: 4; font-size: 1.5em; margin-top: 1rem; margin-bottom: 1rem; position: relative; font-weight: bold; line-height: 1.43; cursor: text; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Why this matters</h3><p cid="n99" mdtype="paragraph" class="md-end-block md-p" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Anyone running Qwen-VL inference on a Blackwell GPU in bf16 silently pays a ~50,000× cost on the patch projection. For 30,000-sample feature extraction, this is the difference between 6 days and ~2 hours.<p cid="n100" mdtype="paragraph" class="md-end-block md-p md-focus" style="box-sizing: border-box; line-height: inherit; orphans: 4; margin: 0.8em 0px; white-space: pre-wrap; position: relative; color: rgb(51, 51, 51); font-family: "Open Sans", "Clear Sans", "Helvetica Neue", Helvetica, Arial, "Segoe UI Emoji", "SF Pro", sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; text-align: start; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-thickness: initial; text-decoration-style: initial; text-decoration-color: initial;">Happy to send a PR with the rewrite, the backward-compat <code style="box-sizing: border-box; font-family: var(--monospace); text-align: left; vertical-align: initial; border: 1px solid rgb(231, 234, 237); background-color: rgb(243, 244, 244); border-radius: 3px; padding: 0px 2px; font-size: 0.9em;">_load_from_state_dict</code> hook, a unit test, and a benchmark script.

extent analysis

TL;DR

Replace the Conv3d layer with an equivalent Linear layer in Qwen3VLVisionPatchEmbed.forward to achieve a significant speedup.

Guidance

Identify the bottleneck: The Qwen3VLVisionPatchEmbed.forward method is the main contributor to the slowdown, specifically the Conv3d layer.
Replace Conv3d with Linear: Modify the Qwen3VLVisionPatchEmbed class to use a Linear layer instead of Conv3d for the patch projection.
Add backward compatibility: Implement the _load_from_state_dict hook to ensure compatibility with existing checkpoints.
Verify numerical equivalence: Check that the modified implementation produces the same results as the original Conv3d layer.

Example

class Qwen3VLVisionPatchEmbed(nn.Module):
    def __init__(self, config) -> None:
        super().__init__()
        #...
        in_dim = (self.in_channels * self.temporal_patch_size
                  * self.patch_size * self.patch_size)
        self.proj = nn.Linear(in_dim, self.embed_dim, bias=True)

    def forward(self, hidden_states):
        #...
        hidden_states = hidden_states.reshape(-1, self.proj.in_features)
        hidden_states = self.proj(hidden_states.to(dtype=self.proj.weight.dtype))
        return hidden_states

Notes

This fix applies to Qwen2-VL and Qwen2.5-VL models as well.
The modified implementation should be thoroughly tested to ensure correctness and numerical equivalence.

Recommendation

Apply the workaround by replacing the Conv3d layer with an equivalent Linear layer in Qwen3VLVisionPatchEmbed.forward. This change should result in a significant speedup without affecting the model's accuracy.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

check	tolerance	observed
fp32 max abs diff (proj output)	< 1e-5	< 1e-7
bf16 cosine similarity (proj output)	> 0.999	0.9995
bf16 cosine similarity (full 24-layer vision tower)	> 0.99	> 0.999 per sample

#API routing #API middleware #SSR setup #ISR setup #authentication setup

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix `Qwen3VLVisionPatchEmbed.proj` (`nn.Conv3d` with `stride == kernel`) is ~50,000× slower than equivalent `nn.Linear` on Blackwell + bf16 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #45771: perf(qwen3_vl): replace Conv3d with F.linear in patch embed forward

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix `Qwen3VLVisionPatchEmbed.proj` (`nn.Conv3d` with `stride == kernel`) is ~50,000× slower than equivalent `nn.Linear` on Blackwell + bf16 [1 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

PR fix notes

PR #45771: perf(qwen3_vl): replace Conv3d with F.linear in patch embed forward

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING