transformers - 💡(How to fix) Fix Incorrect LLM output when using pipeline parallelism [1 participants]

transformers2026-03-23 11:26:59

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44945•Fetched 2026-04-08 01:16:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

tasinislam21

Participants

tasinislam21

Timeline (top)

mentioned ×2subscribed ×2labeled ×1

Root Cause

Couldn't use torch.float32 because the T4 GPU does not have enough memory.

RAW_BUFFERClick to expand / collapse

System Info

transformers==4.57.1 Python==3.12.12 Kaggle env

Who can help?

@CyrilVallez @3outeille

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I am developing a notebook that runs the Molmo2 - action recognition and video understanding LLM model - on Kaggle. This setup will allow users with limited computational resources to run a demo on Kaggle's GPU for free. Kaggle provides an environment with 2 NVIDIA T4 GPUs. I have manually mapped the layers across each GPU to ensure that they fit within the VRAM constraints. However, I am experiencing extremely poor model performance, as it seems to operate as if the checkpoints were not loaded correctly.

On a single GPU or CPU, the model functions properly and produces expected results. Could someone please review my notebook and suggest a solution to this issue? Your help would be greatly appreciated.

Link to my notebook.

What I have already tried:

Used the load_in_8bit parameter, but when I called the generate function, I encountered a NotImplementedError, so I reverted back to using torch.float16.
Couldn't use torch.float32 because the T4 GPU does not have enough memory.
Tried using the argument device_map="auto", but the mapping was problematic, as half of a block stayed on one device while the other half ended up elsewhere. This is an issue when residuals are involved.

Expected behavior

The model should say that there are penguins in the video.

extent analysis

Fix Plan

To resolve the poor model performance issue with the Molmo2 model on multiple GPUs, we'll focus on proper model loading and device mapping.

Step 1: Model Loading
- Ensure that the model is loaded correctly on multiple GPUs by using the device_map argument.
- Instead of using device_map="auto", define a custom device map to ensure that the model layers are properly distributed across the GPUs.
Step 2: Custom Device Mapping
- Create a custom device map that assigns the model layers to the available GPUs.
- Use the transformers library's device_map functionality to define the custom mapping.

Example code snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the custom device map
device_map = {
    "model": {
        "embeddings": "cuda:0",
        "encoder": {"layer.0": "cuda:0", "layer.1": "cuda:1"},
        "decoder": {"layer.0": "cuda:1", "layer.1": "cuda:0"},
    }
}

# Load the model with the custom device map
model = AutoModelForCausalLM.from_pretrained("allenai/Molmo2-8B", device_map=device_map)

Step 3: Model Inference
- Perform model inference using the loaded model.
- Ensure that the input data is properly moved to the correct device (GPU) before passing it to the model.

Example code snippet:

# Move the input data to the correct device
input_ids = input_ids.to("cuda:0")

# Perform model inference
output = model.generate(input_ids)

Verification

To verify that the fix worked, check the model's performance on the target task. In this case, the model should correctly identify the presence of penguins in the video.

Run the model inference code with the custom device map and verify that the output is correct.
Compare the model's performance with the expected behavior to ensure that the issue is resolved.

Extra Tips

When working with large models like Molmo2, it's essential to carefully manage memory usage to avoid running out of VRAM.
Experiment with different device mappings to find the optimal configuration for your specific use case.
Consider using model pruning or quantization techniques to reduce the model's memory footprint and improve performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

The model should say that there are penguins in the video.

#configuration error #environment variable #network issue #logging issue #authentication issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Incorrect LLM output when using pipeline parallelism [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Incorrect LLM output when using pipeline parallelism [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING