transformers - ✅(Solved) Fix git-oss-20b will not properly load with MXFP4 quantization and falls back to bf16 [1 pull requests, 5 comments, 4 participants]

transformers2026-03-21 17:38:02

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#44912•Fetched 2026-04-08 01:12:34

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×7subscribed ×7commented ×5referenced ×3

Fix Action

Fixed

Fixed by PR: fix: split MXFP4 dependency checks for specific error messages (https://github.com/huggingface/transformers/pull/44930)

PR fix notes

PR #44930: fix: split MXFP4 dependency checks for specific error messages

Repository: huggingface/transformers
Author: javierdejesusda
State: closed | merged: True
Link: https://github.com/huggingface/transformers/pull/44930

Description (problem / solution / changelog)

Summary

Fixes #44912 — MXFP4 quantization error messages combine is_triton_available() and is_kernels_available() into a single kernels_available boolean, making it impossible to identify which dependency is missing
Split the combined check into separate triton_available and kernels_installed variables with individual, actionable error messages including install instructions
Hardened is_triton_available() and is_kernels_available() in import_utils.py to handle "N/A" version strings gracefully (returns False instead of crashing with InvalidVersion)
Follows the established pattern in quantizer_higgs.py where each dependency is checked individually

Test plan

Added test_warning_missing_kernels_only — verifies warning specifically mentions kernels when only kernels is missing
Added test_warning_missing_triton_only — verifies warning specifically mentions triton when only triton is missing
Added test_error_missing_kernels_not_prequantized — verifies ValueError mentions kernels
Added test_error_missing_triton_not_prequantized — verifies ValueError mentions triton
Added test_is_kernels_available_with_na_version — handles "N/A" version without crash
Added test_is_triton_available_with_na_version — handles "N/A" version without crash
All existing MXFP4 tests pass (23 passed, 8 skipped)
ruff format --check and ruff check clean on modified files (3 pre-existing warnings in untouched code)

Changed files

src/transformers/quantizers/quantizer_mxfp4.py (modified, +30/-13)
tests/quantization/mxfp4/test_mxfp4.py (modified, +73/-0)

RAW_BUFFERClick to expand / collapse

System Info

Hi, probably this is related to https://github.com/huggingface/transformers/issues/42723

I get: MXFP4 quantization requires Triton and kernels installed: CUDA requires Triton >= 3.4.0, XPU requires Triton >= 3.5.0, we will default to dequantizing the model to bf16

When executing the following code: pipeline = transformers.pipeline( "text-generation", model=model_id, token=HF_TOKEN, #model_kwargs={"dtype": torch.bfloat16}, device_map="auto", )

$ hf env

Copy-and-paste the text below in your GitHub issue.

huggingface_hub version: 1.7.2
Platform: Linux-6.8.0-106-generic-x86_64-with-glibc2.39
Python version: 3.12.3
Running in iPython ?: No
Running in notebook ?: No
Running in Google Colab ?: No
Running in Google Colab Enterprise ?: No
Token path ?: /scratch/work/ml_ki/llama/huggingface/token
Has saved token ?: True
Who am I ?: jottbe
Configured git credential helpers:
Installation method: unknown
httpx: 0.28.1
hf_xet: 1.4.2
gradio: N/A
tensorboard: N/A
ENDPOINT: https://huggingface.co
HF_HUB_CACHE: /scratch/work/ml_ki/llama/huggingface/hub
HF_ASSETS_CACHE: /scratch/work/ml_ki/llama/huggingface/assets
HF_TOKEN_PATH: /scratch/work/ml_ki/llama/huggingface/token
HF_STORED_TOKENS_PATH: /scratch/work/ml_ki/llama/huggingface/stored_tokens
HF_HUB_OFFLINE: False
HF_HUB_DISABLE_TELEMETRY: False
HF_HUB_DISABLE_PROGRESS_BARS: None
HF_HUB_DISABLE_SYMLINKS_WARNING: False
HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
HF_HUB_DISABLE_IMPLICIT_TOKEN: False
HF_HUB_DISABLE_XET: False
HF_HUB_ETAG_TIMEOUT: 10
HF_HUB_DOWNLOAD_TIMEOUT: 10
HF_XET_HIGH_PERFORMANCE: False

Package versions: transformers 5.3.0 triton 3.6.0 nvidia-cublas-cu12 12.8.4.1 nvidia-cuda-cupti-cu12 12.8.90 nvidia-cuda-nvrtc-cu12 12.8.93 nvidia-cuda-runtime-cu12 12.8.90 nvidia-cudnn-cu12 9.10.2.21 nvidia-cufft-cu12 11.3.3.83 nvidia-cufile-cu12 1.13.1.3 nvidia-curand-cu12 10.3.9.90 nvidia-cusolver-cu12 11.7.3.90 nvidia-cusparse-cu12 12.5.8.93 nvidia-cusparselt-cu12 0.7.1 nvidia-nccl-cu12 2.27.5 nvidia-nvjitlink-cu12 12.8.93 nvidia-nvshmem-cu12 3.4.5 nvidia-nvtx-cu12 12.8.90 torch 2.10.0 torchaudio 2.10.0 torchvision 0.25.0

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0

Who can help?

@ArthurZucker @Cyrilvallez

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

pipeline = transformers.pipeline( "text-generation", model=model_id, token=HF_TOKEN, #model_kwargs={"dtype": torch.bfloat16}, device_map="auto", )

Expected behavior

Should load with 4bit quantization.

extent analysis

Fix Plan

To fix the issue with MXFP4 quantization, you need to ensure that you have the correct version of Triton installed and that you are using the correct device map.

Here are the steps to follow:

Check that your Triton version is compatible with your CUDA version. In this case, you have CUDA 12.0 and Triton 3.6.0, which is compatible.
Make sure you are using the correct device map. You can try setting the device map to a specific device, such as device_map="cuda:0".

Code Changes

You can modify your pipeline code as follows:

import torch

# Set the device map to a specific device
device_map = "cuda:0"

# Create the pipeline with the specified device map
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    token=HF_TOKEN,
    device_map=device_map,
)

# Alternatively, you can try setting the dtype to torch.bfloat16
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=model_id,
#     token=HF_TOKEN,
#     model_kwargs={"dtype": torch.bfloat16},
#     device_map=device_map,
# )

Verification

To verify that the fix worked, you can check the pipeline's device map and dtype:

print(pipeline.device_map)
print(pipeline.model.dtype)

If the device map is set to the correct device and the dtype is set to torch.bfloat16, then the fix was successful.

Extra Tips

Make sure you have the latest version of the transformers library and Triton installed.
If you are still having issues, try setting the HF_HUB_CACHE environment variable to a different location to rule out any caching issues.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Should load with 4bit quantization.

#installation #runtime error #dependency conflict #environment setup #docker error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - ✅(Solved) Fix git-oss-20b will not properly load with MXFP4 quantization and falls back to bf16 [1 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44930: fix: split MXFP4 dependency checks for specific error messages

Description (problem / solution / changelog)

Summary

Test plan

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - ✅(Solved) Fix git-oss-20b will not properly load with MXFP4 quantization and falls back to bf16 [1 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fixed

PR fix notes

PR #44930: fix: split MXFP4 dependency checks for specific error messages

Description (problem / solution / changelog)

Summary

Test plan

Changed files

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING