transformers - ✅(Solved) Fix git-oss-20b will not properly load with MXFP4 quantization and falls back to bf16 [1 pull requests, 5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#44912Fetched 2026-04-08 01:12:34
View on GitHub
Comments
5
Participants
4
Timeline
25
Reactions
0
Author
Timeline (top)
mentioned ×7subscribed ×7commented ×5referenced ×3

Fix Action

Fixed

PR fix notes

PR #44930: fix: split MXFP4 dependency checks for specific error messages

Description (problem / solution / changelog)

Summary

  • Fixes #44912 — MXFP4 quantization error messages combine is_triton_available() and is_kernels_available() into a single kernels_available boolean, making it impossible to identify which dependency is missing
  • Split the combined check into separate triton_available and kernels_installed variables with individual, actionable error messages including install instructions
  • Hardened is_triton_available() and is_kernels_available() in import_utils.py to handle "N/A" version strings gracefully (returns False instead of crashing with InvalidVersion)
  • Follows the established pattern in quantizer_higgs.py where each dependency is checked individually

Test plan

  • Added test_warning_missing_kernels_only — verifies warning specifically mentions kernels when only kernels is missing
  • Added test_warning_missing_triton_only — verifies warning specifically mentions triton when only triton is missing
  • Added test_error_missing_kernels_not_prequantized — verifies ValueError mentions kernels
  • Added test_error_missing_triton_not_prequantized — verifies ValueError mentions triton
  • Added test_is_kernels_available_with_na_version — handles "N/A" version without crash
  • Added test_is_triton_available_with_na_version — handles "N/A" version without crash
  • All existing MXFP4 tests pass (23 passed, 8 skipped)
  • ruff format --check and ruff check clean on modified files (3 pre-existing warnings in untouched code)

Changed files

  • src/transformers/quantizers/quantizer_mxfp4.py (modified, +30/-13)
  • tests/quantization/mxfp4/test_mxfp4.py (modified, +73/-0)
RAW_BUFFERClick to expand / collapse

System Info

Hi, probably this is related to https://github.com/huggingface/transformers/issues/42723

I get: MXFP4 quantization requires Triton and kernels installed: CUDA requires Triton >= 3.4.0, XPU requires Triton >= 3.5.0, we will default to dequantizing the model to bf16

When executing the following code: pipeline = transformers.pipeline( "text-generation", model=model_id, token=HF_TOKEN, #model_kwargs={"dtype": torch.bfloat16}, device_map="auto", )

$ hf env

Copy-and-paste the text below in your GitHub issue.

  • huggingface_hub version: 1.7.2
  • Platform: Linux-6.8.0-106-generic-x86_64-with-glibc2.39
  • Python version: 3.12.3
  • Running in iPython ?: No
  • Running in notebook ?: No
  • Running in Google Colab ?: No
  • Running in Google Colab Enterprise ?: No
  • Token path ?: /scratch/work/ml_ki/llama/huggingface/token
  • Has saved token ?: True
  • Who am I ?: jottbe
  • Configured git credential helpers:
  • Installation method: unknown
  • httpx: 0.28.1
  • hf_xet: 1.4.2
  • gradio: N/A
  • tensorboard: N/A
  • ENDPOINT: https://huggingface.co
  • HF_HUB_CACHE: /scratch/work/ml_ki/llama/huggingface/hub
  • HF_ASSETS_CACHE: /scratch/work/ml_ki/llama/huggingface/assets
  • HF_TOKEN_PATH: /scratch/work/ml_ki/llama/huggingface/token
  • HF_STORED_TOKENS_PATH: /scratch/work/ml_ki/llama/huggingface/stored_tokens
  • HF_HUB_OFFLINE: False
  • HF_HUB_DISABLE_TELEMETRY: False
  • HF_HUB_DISABLE_PROGRESS_BARS: None
  • HF_HUB_DISABLE_SYMLINKS_WARNING: False
  • HF_HUB_DISABLE_EXPERIMENTAL_WARNING: False
  • HF_HUB_DISABLE_IMPLICIT_TOKEN: False
  • HF_HUB_DISABLE_XET: False
  • HF_HUB_ETAG_TIMEOUT: 10
  • HF_HUB_DOWNLOAD_TIMEOUT: 10
  • HF_XET_HIGH_PERFORMANCE: False

Package versions: transformers 5.3.0 triton 3.6.0 nvidia-cublas-cu12 12.8.4.1 nvidia-cuda-cupti-cu12 12.8.90 nvidia-cuda-nvrtc-cu12 12.8.93 nvidia-cuda-runtime-cu12 12.8.90 nvidia-cudnn-cu12 9.10.2.21 nvidia-cufft-cu12 11.3.3.83 nvidia-cufile-cu12 1.13.1.3 nvidia-curand-cu12 10.3.9.90 nvidia-cusolver-cu12 11.7.3.90 nvidia-cusparse-cu12 12.5.8.93 nvidia-cusparselt-cu12 0.7.1 nvidia-nccl-cu12 2.27.5 nvidia-nvjitlink-cu12 12.8.93 nvidia-nvshmem-cu12 3.4.5 nvidia-nvtx-cu12 12.8.90 torch 2.10.0 torchaudio 2.10.0 torchvision 0.25.0

$ nvcc --version nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2023 NVIDIA Corporation Built on Fri_Jan__6_16:45:21_PST_2023 Cuda compilation tools, release 12.0, V12.0.140 Build cuda_12.0.r12.0/compiler.32267302_0

Who can help?

@ArthurZucker @Cyrilvallez

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

pipeline = transformers.pipeline( "text-generation", model=model_id, token=HF_TOKEN, #model_kwargs={"dtype": torch.bfloat16}, device_map="auto", )

Expected behavior

Should load with 4bit quantization.

extent analysis

Fix Plan

To fix the issue with MXFP4 quantization, you need to ensure that you have the correct version of Triton installed and that you are using the correct device map.

Here are the steps to follow:

  • Check that your Triton version is compatible with your CUDA version. In this case, you have CUDA 12.0 and Triton 3.6.0, which is compatible.
  • Make sure you are using the correct device map. You can try setting the device map to a specific device, such as device_map="cuda:0".

Code Changes

You can modify your pipeline code as follows:

import torch

# Set the device map to a specific device
device_map = "cuda:0"

# Create the pipeline with the specified device map
pipeline = transformers.pipeline(
    "text-generation",
    model=model_id,
    token=HF_TOKEN,
    device_map=device_map,
)

# Alternatively, you can try setting the dtype to torch.bfloat16
# pipeline = transformers.pipeline(
#     "text-generation",
#     model=model_id,
#     token=HF_TOKEN,
#     model_kwargs={"dtype": torch.bfloat16},
#     device_map=device_map,
# )

Verification

To verify that the fix worked, you can check the pipeline's device map and dtype:

print(pipeline.device_map)
print(pipeline.model.dtype)

If the device map is set to the correct device and the dtype is set to torch.bfloat16, then the fix was successful.

Extra Tips

  • Make sure you have the latest version of the transformers library and Triton installed.
  • If you are still having issues, try setting the HF_HUB_CACHE environment variable to a different location to rule out any caching issues.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Should load with 4bit quantization.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING