transformers - 💡(How to fix) Fix Model quantized via sinq broken after save_pretrained and from_pretrained [3 comments, 3 participants]

Q: Expected behavior

Expect a chat message to be printed. Example: `{'role': 'assistant', 'content': "Understood. I'm ready when you are! How can I help you with this test?"}`

transformers2026-05-19 03:51:53

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

huggingface/transformers#46050•Fetched 2026-05-20 03:39:22

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×3cross-referenced ×1labeled ×1

Code Example

from transformers import AutoProcessor, AutoModelForCausalLM, SinqConfig

def quantize_model(model_id: str, save_dst: str):
    # Quantize and save resulting model
    quant_cfg = SinqConfig(
        nbits=8,
        group_size=64,
        tiling_mode='2D',
        method='sinq',
        modules_to_not_convert=["lm_head", "model.audio_tower"],
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map = 'cpu',
        quantization_config=quant_cfg,
    )

    model.save_pretrained(save_dst)



def test_quantized(model_id: str, save_dst: str):
    # Loading quantized model and appropriate processor 
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(save_dst, device_map='cpu')

    # Testing quantized model
    chat = []
    chat.append({ 'role': 'user', 'content': 'this is a test.' })

    model_input = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
    skip_length = model_input['input_ids'].shape[-1]

    new_tokens = model.generate(**model_input, max_new_tokens=2048)[0, skip_length:]
    
    message = processor.parse_response(processor.decode(new_tokens))

    print(message)


model_id = 'google/gemma-4-E4B-it'
save_dst = './gemma-4-E4B-it-sinq/'

quantize_model(model_id, save_dst)
test_quantized(model_id, save_dst)

RAW_BUFFERClick to expand / collapse

System Info

transformers version: 5.8.1
Platform: Linux-7.0.5-arch1-1-x86_64-with-glibc2.43
Python version: 3.12.13
Huggingface_hub version: 1.14.0
Safetensors version: 0.7.0
Accelerate version: 1.13.0
Accelerate config: not found
DeepSpeed version: not installed
PyTorch version (accelerator?): 2.11.0+cu130 (CUDA)
Using distributed or parallel set-up in script?: <fill in>
Using GPU in script?: <fill in>
GPU type: NVIDIA GeForce GTX 1650

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

The following code segment fail with KeyError: 'packing'

If the model is not saved and loaded (via save_pretrained and from_pretrained) and is used right away after the first from_pretrained, the model behaves as expected.

NOTE: I have only tried this with google/gemma-4-E4B-it so this may be an issue unique to the model instead of sinq.

from transformers import AutoProcessor, AutoModelForCausalLM, SinqConfig

def quantize_model(model_id: str, save_dst: str):
    # Quantize and save resulting model
    quant_cfg = SinqConfig(
        nbits=8,
        group_size=64,
        tiling_mode='2D',
        method='sinq',
        modules_to_not_convert=["lm_head", "model.audio_tower"],
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map = 'cpu',
        quantization_config=quant_cfg,
    )

    model.save_pretrained(save_dst)



def test_quantized(model_id: str, save_dst: str):
    # Loading quantized model and appropriate processor 
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(save_dst, device_map='cpu')

    # Testing quantized model
    chat = []
    chat.append({ 'role': 'user', 'content': 'this is a test.' })

    model_input = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
    skip_length = model_input['input_ids'].shape[-1]

    new_tokens = model.generate(**model_input, max_new_tokens=2048)[0, skip_length:]
    
    message = processor.parse_response(processor.decode(new_tokens))

    print(message)


model_id = 'google/gemma-4-E4B-it'
save_dst = './gemma-4-E4B-it-sinq/'

quantize_model(model_id, save_dst)
test_quantized(model_id, save_dst)

Expected behavior

Expect a chat message to be printed. Example: {'role': 'assistant', 'content': "Understood. I'm ready when you are! How can I help you with this test?"}

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

Expect a chat message to be printed. Example: {'role': 'assistant', 'content': "Understood. I'm ready when you are! How can I help you with this test?"}

#generation error #database connection #vector store #embedding generation #cache error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

transformers - 💡(How to fix) Fix Model quantized via sinq broken after save_pretrained and from_pretrained [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

TRENDING

transformers - 💡(How to fix) Fix Model quantized via sinq broken after save_pretrained and from_pretrained [3 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Code Example

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING