transformers - 💡(How to fix) Fix Only TP not working with GPT-OSS MoE model [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#45161Fetched 2026-04-08 02:22:17
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
0
Participants
Timeline (top)
mentioned ×2subscribed ×2labeled ×1

Error Message

###Traceback:

Code Example

###Traceback:
RAW_BUFFERClick to expand / collapse

System Info

python version: 3.10 transformers version: v5.1-release branch torch version: 2.9.1+cu128

Who can help?

While running my training script to finetune gpt-oss-20b model using Tensor Parallelism, the code is breaking at this line in moe.py

Error: [rank0]: num_tokens_per_expert = torch.histc(histc_input, bins=self.num_experts, min=0, max=self.num_experts - 1) [rank0]: torch.AcceleratorError: CUDA error: device-side assert triggered [rank0]: Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information. [rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.

Complete Traceback is attached below.

CC: @SunMarc @ArthurZucker

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

  1. setup venv using python3.10, use torch 2.9.1 and git clone the v5.1-release branch
  2. Run the command: TP_SIZE=2 torchrun --nproc_per_node=4 train_tp_trainer.py

train_tp_trainer.py code:

import torch
import torch.distributed as dist
from torch.distributed.device_mesh import init_device_mesh

from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling,
)

# -- Config -------------------------------------------------------------------
MODEL_NAME  = "openai/gpt-oss-20b"
TP_SIZE     = int(os.environ.get("TP_SIZE", "4"))
MAX_SEQ_LEN = 512
OUTPUT_DIR  = "./gpt_oss_20b_tp_output"

# -- Distributed Init ---------------------------------------------------------
dist.init_process_group(backend="nccl")

rank       = dist.get_rank()
local_rank = int(os.environ["LOCAL_RANK"])
world_size = dist.get_world_size()

torch.cuda.set_device(local_rank)

# -- Device Mesh --------------------------------------------------------------

tp_mesh = init_device_mesh(
    device_type="cuda",
    mesh_shape=(TP_SIZE,),
    mesh_dim_names=("tp",),
)

# -- Tokenizer ----------------------------------------------------------------
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token

# -- Model with Native TP -----------------------------------------------------

model = AutoModelForCausalLM.from_pretrained(
    MODEL_NAME,
    tp_plan="auto",
    device_mesh=tp_mesh,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)

# -- Dataset ------------------------------------------------------------------
raw_dataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train[:1%]")

def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        max_length=MAX_SEQ_LEN,
        padding="max_length",
    )

tokenized_dataset = raw_dataset.map(
    tokenize,
    batched=True,
    remove_columns=["text"]
)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

# -- Training Arguments -------------------------------------------------------
training_args = TrainingArguments(
    output_dir=OUTPUT_DIR,
    num_train_epochs=1,
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    bf16=True,
    logging_steps=10,
    save_steps=100,
    remove_unused_columns=False,
    dataloader_num_workers=2,
    ddp_find_unused_parameters=False,
)

# -- Trainer ------------------------------------------------------------------
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
    data_collator=data_collator,
    processing_class=tokenizer, 
    )  

# -- Train ---------------------------------------------------------------------
trainer.train()

# -- Save (rank 0 only) --------------------------------------------------------
if rank == 0:
    trainer.save_model(OUTPUT_DIR)
    tokenizer.save_pretrained(OUTPUT_DIR)
    print(f"Model saved to {OUTPUT_DIR}")

dist.destroy_process_group()

###Traceback:

/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [679,0,0], thread: [190,0,0] Assertion `ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernelUtils.cu:16: vectorized_gather_kernel: block: [679,0,0], thread: [191,0,0] Assertion `ind >=0 && ind < ind_dim_size && "vectorized gather kernel index out of bounds"` failed.
[rank0]: Traceback (most recent call last):
[rank0]:   File "/local/mnt/workspace/OG_TP/train_tp_trainer.py", line 104, in <module>
[rank0]:     trainer.train()
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/trainer.py", line 2129, in train
[rank0]:     return inner_training_loop(
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/trainer.py", line 2496, in _inner_training_loop
[rank0]:     tr_loss_step = self.training_step(model, inputs, num_items_in_batch)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/trainer.py", line 3776, in training_step
[rank0]:     loss = self.compute_loss(model, inputs, num_items_in_batch=num_items_in_batch)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/trainer.py", line 3847, in compute_loss
[rank0]:     outputs = model(**inputs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/accelerate/src/accelerate/utils/operations.py", line 823, in forward
[rank0]:     return model_forward(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/accelerate/src/accelerate/utils/operations.py", line 811, in __call__
[rank0]:     return convert_to_fp32(self.model_forward(*args, **kwargs))
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/amp/autocast_mode.py", line 44, in decorate_autocast
[rank0]:     return func(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/utils/generic.py", line 837, in wrapper
[rank0]:     output = func(self, *args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/models/gpt_oss/modeling_gpt_oss.py", line 659, in forward
[rank0]:     outputs: MoeModelOutputWithPast = self.model(
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/utils/generic.py", line 1017, in wrapper
[rank0]:     outputs = func(self, *args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/models/gpt_oss/modeling_gpt_oss.py", line 498, in forward
[rank0]:     hidden_states = decoder_layer(
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/modeling_layers.py", line 93, in __call__
[rank0]:     return super().__call__(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/models/gpt_oss/modeling_gpt_oss.py", line 389, in forward
[rank0]:     hidden_states, _ = self.mlp(hidden_states)  # diff with llama: router scores
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/models/gpt_oss/modeling_gpt_oss.py", line 143, in forward
[rank0]:     hidden_states = self.experts(hidden_states, router_indices, router_scores)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
[rank0]:     return inner()
[rank0]:   File "/local/mnt/workspace/.og_tp_env/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1829, in inner
[rank0]:     result = forward_call(*args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/integrations/moe.py", line 350, in forward
[rank0]:     return experts_forward(self, *args, **kwargs)
[rank0]:   File "/local/mnt/workspace/OG_TP/transformers/src/transformers/integrations/moe.py", line 248, in grouped_mm_experts_forward
[rank0]:     num_tokens_per_expert = torch.histc(histc_input, bins=self.num_experts, min=0, max=self.num_experts - 1)
[rank0]: torch.AcceleratorError: CUDA error: device-side assert triggered
[rank0]: Search for `cudaErrorAssert' in https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html for more information.
[rank0]: CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
[rank0]: For debugging consider passing CUDA_LAUNCH_BLOCKING=1
[rank0]: Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Expected behavior

Since TP is supported, the script should run successfully

extent analysis

TL;DR

The most likely fix is to ensure that the input to torch.histc is within the valid range to prevent device-side assertions.

Guidance

  • Verify that histc_input values are within the range [0, self.num_experts - 1] before calling torch.histc to prevent out-of-bounds errors.
  • Check the router_indices and router_scores calculations in the modeling_gpt_oss.py file to ensure they are producing valid inputs for the experts module.
  • Consider adding error checking or clipping to ensure that histc_input values are within the valid range.
  • For debugging, try setting CUDA_LAUNCH_BLOCKING=1 to get more detailed error information.

Example

No code example is provided as the issue is likely related to the specific input data or calculations within the modeling_gpt_oss.py file.

Notes

The error message indicates a device-side assertion failure, which suggests that the issue is related to the GPU acceleration of the PyTorch operations. The torch.histc function is likely failing due to out-of-bounds input values.

Recommendation

Apply a workaround by adding input validation or clipping to ensure that histc_input values are within the valid range. This can help prevent the device-side assertion failure and allow the script to run successfully.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Since TP is supported, the script should run successfully

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

transformers - 💡(How to fix) Fix Only TP not working with GPT-OSS MoE model [1 participants]