transformers - ✅(Solved) Fix [BUG] Perceiver image classification (non-default res) fails even with interpolate_pos_encoding=True [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44898Fetched 2026-04-08 01:07:55
View on GitHub
Comments
0
Participants
1
Timeline
3
Reactions
0
Participants
Timeline (top)
closed ×1cross-referenced ×1labeled ×1

Error Message

import torch from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil from PIL import Image import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg" image = Image.open(requests.get(url, stream=True).raw) image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384}) model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned") model.eval() inputs = image_processor(image, return_tensors="pt").pixel_values try: with torch.no_grad(): outputs = model(inputs=inputs, interpolate_pos_encoding=True) print("Logits shape:", outputs.logits.shape) predicted_class = outputs.logits.argmax(-1).item() print("Predicted class:", predicted_class) except Exception as e: print(e)

Fix Action

Fixed

PR fix notes

PR #44899: fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size

Description (problem / solution / changelog)

What does this PR do?

The following failing Perceiver use case was identified and fixed in this PR:

→ c6d2848a23 (🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models) refactored all vision models' interpolate_pos_encoding methods for torch.jit.trace; the canonical pattern used across other vision models (e.g. modeling_vit.py, modeling_deit.py) is that they passes the target (height, width) to nn.functional.interpolate; but the Perceiver diff passed the <ins>source</ins> grid dims practically making the interpolation a no-op; this should fix that! → I also checked if other models have the exact same issue; and they don't, they compute new_height = height // self.patch_size (target patch grid) and pass that.

Fixes #44898

Before the fix (feel free to cross-check; these errors are reproducible):

<img width="500" height="500" alt="1" src="https://github.com/user-attachments/assets/5a478c28-2fdd-443d-bb75-612816b87e5f" />

After the fix (feel free to cross-check):

<img width="500" height="500" alt="2" src="https://github.com/user-attachments/assets/fc5eb0c7-09a4-44f6-9e3a-e6e18ec1860f" />

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline, Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the documentation guidelines, and here are tips on formatting docstrings.
  • Did you fix any necessary existing tests?

Changed files

  • src/transformers/models/perceiver/modeling_perceiver.py (modified, +1/-1)

Code Example

import torch
from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384})
model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.eval()
inputs = image_processor(image, return_tensors="pt").pixel_values
try:
    with torch.no_grad():
        outputs = model(inputs=inputs, interpolate_pos_encoding=True)
    print("Logits shape:", outputs.logits.shape)
    predicted_class = outputs.logits.argmax(-1).item()
    print("Predicted class:", predicted_class)
except Exception as e:
    print(e)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.0.0.dev0
  • Platform: Linux-5.15.167.4-microsoft-standard-WSL2-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • huggingface_hub version: 1.3.2
  • safetensors version: 0.7.0
  • accelerate version: 1.12.0
  • Accelerate config: not installed
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
  • GPU type: NVIDIA L4
  • NVIDIA driver version: 550.90.07
  • CUDA version: 12.4

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

import torch
from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
image_processor = PerceiverImageProcessorPil(size={"height": 384, "width": 384})
model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.eval()
inputs = image_processor(image, return_tensors="pt").pixel_values
try:
    with torch.no_grad():
        outputs = model(inputs=inputs, interpolate_pos_encoding=True)
    print("Logits shape:", outputs.logits.shape)
    predicted_class = outputs.logits.argmax(-1).item()
    print("Predicted class:", predicted_class)
except Exception as e:
    print(e)

→ Trying to run image classification on a 384×384 image (pretrained default is 224×224) and even after setting interpolate_pos_encoding=True expecting the model to handle the resolution difference, the model crashes with a RuntimeError. → From the screenshot, 384×384 = 147456 and 224×224 = 50176 so it was never actually resized (see the reproduction output).

Current Repro Output:

<img width="500" height="500" alt="Image" src="https://github.com/user-attachments/assets/3f1ac00d-5f36-4d3b-be2a-21f46accd0bb" />

Expected behavior

→ Inference should complete successfully (torch.Size([1, 1000])) when interpolate_pos_encoding=True is passed with non-native input res.

extent analysis

Fix Plan

To fix the issue, we need to properly resize the input image to the expected size before passing it to the model. We can achieve this by using the image_processor to resize the image.

Step-by-Step Solution

  • Update the image_processor to resize the image to the expected size (384x384) using the size parameter.
  • Pass the resized image to the model.

Example Code

import torch
from transformers import PerceiverForImageClassificationLearned, PerceiverImageProcessorPil
from PIL import Image
import requests

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

# Update the image processor to resize the image
image_processor = PerceiverImageProcessorPil(
    size={"height": 384, "width": 384}, 
    resample=Image.BICUBIC  # Use bicubic resampling for better quality
)

model = PerceiverForImageClassificationLearned.from_pretrained("deepmind/vision-perceiver-learned")
model.eval()

# Preprocess the image
inputs = image_processor(image, return_tensors="pt").pixel_values

try:
    with torch.no_grad():
        outputs = model(inputs=inputs, interpolate_pos_encoding=True)
    print("Logits shape:", outputs.logits.shape)
    predicted_class = outputs.logits.argmax(-1).item()
    print("Predicted class:", predicted_class)
except Exception as e:
    print(e)

Verification

To verify that the fix worked, run the updated code and check if the inference completes successfully. The output should be a tensor with shape (1, 1000), representing the logits for the 1000 classes.

Extra Tips

  • Make sure to use the correct resampling filter when resizing images to avoid losing important details.
  • If you're working with large images, consider using a more efficient resizing method or a library like OpenCV for better performance.
  • Always verify the input shape and size before passing it to the model to avoid any potential issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

→ Inference should complete successfully (torch.Size([1, 1000])) when interpolate_pos_encoding=True is passed with non-native input res.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - ✅(Solved) Fix [BUG] Perceiver image classification (non-default res) fails even with interpolate_pos_encoding=True [2 pull requests, 1 participants]