transformers - 💡(How to fix) Fix Model quantized via sinq broken after save_pretrained and from_pretrained [3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
huggingface/transformers#46050Fetched 2026-05-20 03:39:22
View on GitHub
Comments
3
Participants
3
Timeline
5
Reactions
0
Timeline (top)
commented ×3cross-referenced ×1labeled ×1

Code Example

from transformers import AutoProcessor, AutoModelForCausalLM, SinqConfig

def quantize_model(model_id: str, save_dst: str):
    # Quantize and save resulting model
    quant_cfg = SinqConfig(
        nbits=8,
        group_size=64,
        tiling_mode='2D',
        method='sinq',
        modules_to_not_convert=["lm_head", "model.audio_tower"],
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map = 'cpu',
        quantization_config=quant_cfg,
    )

    model.save_pretrained(save_dst)



def test_quantized(model_id: str, save_dst: str):
    # Loading quantized model and appropriate processor 
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(save_dst, device_map='cpu')

    # Testing quantized model
    chat = []
    chat.append({ 'role': 'user', 'content': 'this is a test.' })

    model_input = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
    skip_length = model_input['input_ids'].shape[-1]

    new_tokens = model.generate(**model_input, max_new_tokens=2048)[0, skip_length:]
    
    message = processor.parse_response(processor.decode(new_tokens))

    print(message)


model_id = 'google/gemma-4-E4B-it'
save_dst = './gemma-4-E4B-it-sinq/'

quantize_model(model_id, save_dst)
test_quantized(model_id, save_dst)
RAW_BUFFERClick to expand / collapse

System Info

  • transformers version: 5.8.1
  • Platform: Linux-7.0.5-arch1-1-x86_64-with-glibc2.43
  • Python version: 3.12.13
  • Huggingface_hub version: 1.14.0
  • Safetensors version: 0.7.0
  • Accelerate version: 1.13.0
  • Accelerate config: not found
  • DeepSpeed version: not installed
  • PyTorch version (accelerator?): 2.11.0+cu130 (CUDA)
  • Using distributed or parallel set-up in script?: <fill in>
  • Using GPU in script?: <fill in>
  • GPU type: NVIDIA GeForce GTX 1650

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

The following code segment fail with KeyError: 'packing'

If the model is not saved and loaded (via save_pretrained and from_pretrained) and is used right away after the first from_pretrained, the model behaves as expected.

NOTE: I have only tried this with google/gemma-4-E4B-it so this may be an issue unique to the model instead of sinq.

from transformers import AutoProcessor, AutoModelForCausalLM, SinqConfig

def quantize_model(model_id: str, save_dst: str):
    # Quantize and save resulting model
    quant_cfg = SinqConfig(
        nbits=8,
        group_size=64,
        tiling_mode='2D',
        method='sinq',
        modules_to_not_convert=["lm_head", "model.audio_tower"],
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map = 'cpu',
        quantization_config=quant_cfg,
    )

    model.save_pretrained(save_dst)



def test_quantized(model_id: str, save_dst: str):
    # Loading quantized model and appropriate processor 
    processor = AutoProcessor.from_pretrained(model_id)
    model = AutoModelForCausalLM.from_pretrained(save_dst, device_map='cpu')

    # Testing quantized model
    chat = []
    chat.append({ 'role': 'user', 'content': 'this is a test.' })

    model_input = processor.apply_chat_template(chat, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device)
    skip_length = model_input['input_ids'].shape[-1]

    new_tokens = model.generate(**model_input, max_new_tokens=2048)[0, skip_length:]
    
    message = processor.parse_response(processor.decode(new_tokens))

    print(message)


model_id = 'google/gemma-4-E4B-it'
save_dst = './gemma-4-E4B-it-sinq/'

quantize_model(model_id, save_dst)
test_quantized(model_id, save_dst)

Expected behavior

Expect a chat message to be printed. Example: {'role': 'assistant', 'content': "Understood. I'm ready when you are! How can I help you with this test?"}

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Expect a chat message to be printed. Example: {'role': 'assistant', 'content': "Understood. I'm ready when you are! How can I help you with this test?"}

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING