transformers - 💡(How to fix) Fix Incorrect LLM output when using pipeline parallelism [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44945Fetched 2026-04-08 01:16:48
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
mentioned ×2subscribed ×2labeled ×1

Root Cause

  • Couldn't use torch.float32 because the T4 GPU does not have enough memory.
RAW_BUFFERClick to expand / collapse

System Info

transformers==4.57.1 Python==3.12.12 Kaggle env

Who can help?

@CyrilVallez @3outeille

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I am developing a notebook that runs the Molmo2 - action recognition and video understanding LLM model - on Kaggle. This setup will allow users with limited computational resources to run a demo on Kaggle's GPU for free. Kaggle provides an environment with 2 NVIDIA T4 GPUs. I have manually mapped the layers across each GPU to ensure that they fit within the VRAM constraints. However, I am experiencing extremely poor model performance, as it seems to operate as if the checkpoints were not loaded correctly.

On a single GPU or CPU, the model functions properly and produces expected results. Could someone please review my notebook and suggest a solution to this issue? Your help would be greatly appreciated.

Link to my notebook.

What I have already tried:

  • Used the load_in_8bit parameter, but when I called the generate function, I encountered a NotImplementedError, so I reverted back to using torch.float16.

  • Couldn't use torch.float32 because the T4 GPU does not have enough memory.

  • Tried using the argument device_map="auto", but the mapping was problematic, as half of a block stayed on one device while the other half ended up elsewhere. This is an issue when residuals are involved.

Expected behavior

The model should say that there are penguins in the video.

extent analysis

Fix Plan

To resolve the poor model performance issue with the Molmo2 model on multiple GPUs, we'll focus on proper model loading and device mapping.

  • Step 1: Model Loading

    • Ensure that the model is loaded correctly on multiple GPUs by using the device_map argument.
    • Instead of using device_map="auto", define a custom device map to ensure that the model layers are properly distributed across the GPUs.
  • Step 2: Custom Device Mapping

    • Create a custom device map that assigns the model layers to the available GPUs.
    • Use the transformers library's device_map functionality to define the custom mapping.

Example code snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

# Define the custom device map
device_map = {
    "model": {
        "embeddings": "cuda:0",
        "encoder": {"layer.0": "cuda:0", "layer.1": "cuda:1"},
        "decoder": {"layer.0": "cuda:1", "layer.1": "cuda:0"},
    }
}

# Load the model with the custom device map
model = AutoModelForCausalLM.from_pretrained("allenai/Molmo2-8B", device_map=device_map)
  • Step 3: Model Inference
    • Perform model inference using the loaded model.
    • Ensure that the input data is properly moved to the correct device (GPU) before passing it to the model.

Example code snippet:

# Move the input data to the correct device
input_ids = input_ids.to("cuda:0")

# Perform model inference
output = model.generate(input_ids)

Verification

To verify that the fix worked, check the model's performance on the target task. In this case, the model should correctly identify the presence of penguins in the video.

  • Run the model inference code with the custom device map and verify that the output is correct.
  • Compare the model's performance with the expected behavior to ensure that the issue is resolved.

Extra Tips

  • When working with large models like Molmo2, it's essential to carefully manage memory usage to avoid running out of VRAM.
  • Experiment with different device mappings to find the optimal configuration for your specific use case.
  • Consider using model pruning or quantization techniques to reduce the model's memory footprint and improve performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The model should say that there are penguins in the video.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING