transformers - ✅(Solved) Fix [BUG] Perceiver image classification (non-default res) fails even with interpolate_pos_encoding=True [2 pull requests, 1 participants]

harshaljanjani · 2026-03-20T19:58:09Z

[transformers] PR 44899: fix models : Fix Perceiver interpolate pos encoding interpolating to the source size - Repository: huggingface/transformers - Author:… # PR #44899: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size - Repository: huggingface/transformers - Author: harshaljanjani - State: open | merged: False - Link: https://github.com/huggingface/transformers/pull/44899 ## Description (problem / solution / changelog) ### What does this PR do? The following failing Perceiver use case was identified and fixed in this PR: → c6d2848a23 ([🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models](https://github.com/huggingface/transformers/pull/33226)) refactored all vision models' `interpolate_pos_encoding` methods for torch.jit.trace; the canonical pattern used across other vision models (e.g. [modeling_vit.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/vit/modeling_vit.py#L82-L91), [modeling_deit.py](https://github.com/huggingface/transformers/blob/main/src/transformers/models/deit/modeling_deit.py#L82-L91)) is that they passes the target (height, width) to `nn.functional.interpolate`; but the Perceiver diff passed the source grid dims practically making the interpolation a no-op; this should fix that! → I also checked if other models have the exact same issue; and they don't, they compute `new_height = height // self.patch_size` (target patch grid) and pass that. Fixes #44898 **Before the fix (feel free to cross-check; these errors are reproducible):** **After the fix (feel free to cross-check):** ### Before submitting - [ ] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case). - [x] Did you read the [contributor guideline](https://github.com/huggingface/transformers/blob/main/CONTRIBUTING.md#create-a-pull-request), Pull Request section? - [x] Was this discussed/approved via a Github issue or the [forum](https://discuss.huggingface.co/)? Please add a link to it if that's the case. - [ ] Did you make sure to update the documentation with your changes? Here are the [documentation guidelines](https://github.com/huggingface/transformers/tree/main/docs), and [here are tips on formatting docstrings](https://github.com/huggingface/transformers/tree/main/docs#writing-source-documentation). - [x] Did you fix any necessary existing tests? ## Changed files - `src/transformers/models/perceiver/modeling_perceiver.py` (modified, +1/-1) ## Fixed - Fixed by PR: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size (https://github.com/huggingface/transformers/pull/44899) ### System Info * `transformers` version: `5.0.0.dev0` * Platform: `Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39` * Python version: `3.12.3` * `huggingface_hub` version: `1.3.2` * `safetensors` version: `0.7.0` * `accelerate` version: `1.12.0` * Accelerate config: `not installed` * DeepSpeed version: `not installed` * PyTorch version (accelerator?): `2.9.1+cu128 (CUDA)` * GPU type: `NVIDIA L4` * NVIDIA driver version: `550.90.07` * CUDA version: `12.4` ### Information - [x] The official example scripts - [ ] My own modified scripts ### Tasks - [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below) ### Reproduction ```python import torch from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil from PIL import Image import requests url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384}) model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned") model.eval() inputs = image_processor(image, return_tensors="pt").pixel_values try: with torch.no_grad(): outputs = model(inputs=inputs, interpolate_pos_encoding=True) print("Logits shape:", outputs.logits.shape) predicted_class = outputs.logits.argmax(-1).item() print("Predicted class:", predicted_class) except Exception as e: print(e) ``` → Trying to run image classification on a 384×384 image (pretrained default is 224×224) and even after setting `interpolate_pos_encoding=True` expecting the model to handle the resolution difference, the model crashes with a `RuntimeError`. → From the screenshot, 384×384 = 147456 and 224×224 = 50176 so it was never actually resized (see the reproduction output). **Current Repro Output:** ### Expected behavior → Inference s

transformers2026-03-20 19:58:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44898•Fetched 2026-04-08 01:07:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

harshaljanjani

Participants

harshaljanjani

Timeline (top)

closed ×1cross-referenced ×1labeled ×1

Error Message

import torch from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384}) model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned") model.eval() inputs = image_processor(image, return_tensors="pt").pixel_values try: with torch.no_grad(): outputs = model(inputs=inputs, interpolate_pos_encoding=True) print("Logits shape:", outputs.logits.shape) predicted_class = outputs.logits.argmax(-1).item() print("Predicted class:", predicted_class) except Exception as e: print(e)

Fix Action

Fixed

Fixed by PR: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size (https://github.com/huggingface/transformers/pull/44899)

PR fix notes

PR #44899: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size

Repository: huggingface/transformers
Author: harshaljanjani
State: open | merged: False
Link: https://github.com/huggingface/transformers/pull/44899

Description (problem / solution / changelog)

What does this PR do?

The following failing Perceiver use case was identified and fixed in this PR:

→ c6d2848a23 (🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models) refactored all vision models' interpolate_pos_encoding methods for torch.jit.trace; the canonical pattern used across other vision models (e.g. modeling_vit.py, modeling_deit.py) is that they passes the target (height, width) to nn.functional.interpolate; but the Perceiver diff passed the <ins>source</ins> grid dims practically making the interpolation a no-op; this should fix that! → I also checked if other models have the exact same issue; and they don't, they compute new_height = height // self.patch_size (target patch grid) and pass that.

Fixes #44898

Before the fix (feel free to cross-check; these errors are reproducible):

After the fix (feel free to cross-check):

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline, Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
Did you fix any necessary existing tests?

Changed files

src/transformers/models/perceiver/modeling_perceiver.py (modified, +1/-1)

Code Example

import torch
from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384})
model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.eval()
inputs = image_processor(image, return_tensors="pt").pixel_values
try:
    with torch.no_grad():
        outputs = model(inputs=inputs, interpolate_pos_encoding=True)
    print("Logits shape:", outputs.logits.shape)
    predicted_class = outputs.logits.argmax(-1).item()
    print("Predicted class:", predicted_class)
except Exception as e:
    print(e)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.0.0.dev0
Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
Python version: 3.12.3
huggingface_hub version: 1.3.2
safetensors version: 0.7.0
accelerate version: 1.12.0
Accelerate config: not installed
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
GPU type: NVIDIA L4
NVIDIA driver version: 550.90.07
CUDA version: 12.4

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

import torch
from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384})
model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.eval()
inputs = image_processor(image, return_tensors="pt").pixel_values
try:
    with torch.no_grad():
        outputs = model(inputs=inputs, interpolate_pos_encoding=True)
    print("Logits shape:", outputs.logits.shape)
    predicted_class = outputs.logits.argmax(-1).item()
    print("Predicted class:", predicted_class)
except Exception as e:
    print(e)

→ Trying to run image classification on a 384×384 image (pretrained default is 224×224) and even after setting interpolate_pos_encoding=True expecting the model to handle the resolution difference, the model crashes with a RuntimeError. → From the screenshot, 384×384 = 147456 and 224×224 = 50176 so it was never actually resized (see the reproduction output).

Current Repro Output:

Expected behavior

→ Inference should complete successfully (torch.Size([1, 1000])) when interpolate_pos_encoding=True is passed with non-native input res.

extent analysis

Fix Plan

To fix the issue, we need to properly resize the input image to the expected size before passing it to the model. We can achieve this by using the image_processor to resize the image.

Step-by-Step Solution

Update the image_processor to resize the image to the expected size (384x384) using the size parameter.
Pass the resized image to the model.

Example Code

import torch
from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Update the image processor to resize the image
image_processor = PerceiverImageProcessorPil(
    size={"height": 384, "width": 384}, 
    resample=Image.BICUBIC  # Use bicubic resampling for better quality
)

model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.eval()

# Preprocess the image
inputs = image_processor(image, return_tensors="pt").pixel_values

try:
    with torch.no_grad():
        outputs = model(inputs=inputs, interpolate_pos_encoding=True)
    print("Logits shape:", outputs.logits.shape)
    predicted_class = outputs.logits.argmax(-1).item()
    print("Predicted class:", predicted_class)
except Exception as e:
    print(e)

Verification

To verify that the fix worked, run the updated code and check if the inference completes successfully. The output should be a tensor with shape (1, 1000), representing the logits for the 1000 classes.

Extra Tips

Make sure to use the correct resampling filter when resizing images to avoid losing important details.
If you're working with large images, consider using a more efficient resizing method or a library like OpenCV for better performance.
Always verify the input shape and size before passing it to the model to avoid any potential issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

→ Inference should complete successfully (torch.Size([1, 1000])) when interpolate_pos_encoding=True is passed with non-native input res.

#api #ssr #installation #tensor shape #autograd error #model save/load #optimization #mixed precision #training loop

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix [BUG] Perceiver image classification (non-default res) fails even with interpolate_pos_encoding=True [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44899: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

Code Example

System Info

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix [BUG] Perceiver image classification (non-default res) fails even with interpolate_pos_encoding=True [2 pull requests, 1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Fix Action

Fixed

PR fix notes

PR #44899: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size

Description (problem / solution / changelog)

What does this PR do?

Before submitting

Changed files

Code Example

System Info

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Step-by-Step Solution

Example Code

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING