vllm - ✅(Solved) Fix [Bug]: Gemma 4: Unsloth LoRA adapters are ignored during inference despite successful loading [1 pull requests, 3 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41754Fetched 2026-05-06 06:15:04
View on GitHub
Comments
3
Participants
3
Timeline
5
Reactions
0
Timeline (top)
commented ×3cross-referenced ×1labeled ×1

PR fix notes

PR #39816: [Bugfix]: gemma4 fix lora online serving

Description (problem / solution / changelog)

<!-- markdownlint-disable -->

Fixes: #39815

Purpose

This PR fixes Gemma4 LoRA online serving when aliased module paths point to the same physical module.

Gemma4’s fast-prefill / YOCO split added self_decoder and cross_decoder wrappers. Those wrappers reference existing decoder layers, so the same module can show up under multiple paths:

model.layers.0...
model.self_decoder.decoder_layers.0...

Before this PR, LoRA activation could load weights through one path and then reset the same wrapper through the alias path:

set_lora(...)   for model.layers.0...
reset_lora(...) for model.self_decoder.decoder_layers.0...

Since both names point to the same live wrapper, the reset could wipe out the weights that were just loaded. This made the adapter behave like the base model.

Fix

Keep remove_duplicate=False during module discovery so all valid LoRA paths stay registered:

self.model.named_modules(remove_duplicate=False)

Then dedupe only during activation by physical module identity. This keeps the module names LoRA needs for adapter matching, while making sure set_lora() / reset_lora() only run once per actual wrapper.

This matters for Gemma4ForConditionalGeneration because adapter paths can map to prefixed vLLM paths like:

language_model.model.layers...
language_model.lm_head...

Deduping during _create_lora_modules() can drop names that are still valid for LoRA, so the dedupe needs to happen at activation time instead.

Test Plan

.venv/bin/python -m pytest tests/lora/test_model_manager_aliasing.py -q

Test Result

PASSED

The regression test checks that:

  • the toy model exposes both canonical and aliased module paths
  • both paths resolve to the same physical LoRA wrapper
  • activation keeps nonzero LoRA weights loaded instead of zeroing them through the alias path

Also worth noting that this bug was surfaced after aliasing was introduced in the gemma4 implementation in this PR: #38879

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary> - [x] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)". - [x] The test plan, such as providing test command. - [x] The test results, such as pasting the results comparison before and after, or e2e results - [ ] (Optional) The necessary documentation update, such as updating `supported_models.md` and `examples` for a new model. - [ ] (Optional) Release notes update. If your change is user facing, please update the release notes draft in the [Google Doc](https://docs.google.com/document/d/1YyVqrgX4gHTtrstbq8oWUImOyPCKSGnJ7xtTpmXzlRs/edit?tab=t.0). </details>

Changed files

  • tests/lora/test_model_manager_aliasing.py (added, +181/-0)
  • vllm/lora/model_manager.py (modified, +6/-0)

Code Example

Your output of `python collect_env.py` here

---

model.save_pretrained("model_temp")
    tokenizer.save_pretrained("model_temp")

---

outputs = llm.generate(
        prompts=conversations,
        sampling_params=sampling_params,
        lora_request=LoRARequest("adapter", 1, lora_path)
    )
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

🐛 Describe the bug

Issue: Unsloth-trained Gemma 4 Adapters Ineffective in vLLM

While the bitsandbytes compatibility for Gemma 4 has been addressed (see PR #40321), LoRA adapters trained via Unsloth (e.g., for the 31B model) currently fail to apply correctly in vLLM.

Observations:

  • Tested Hardware: RTX 4090 and A6000 Pro.
  • Behavior: The engine loads without crashing, but the adapter is effectively ignored and has no impact on model output.
  • Attempted Fixes: Manual re-mapping of layers was attempted but did not resolve the issue, suggesting a deeper incompatibility in how the adapter weights are being addressed or integrated.

Unsloth:

    model.save_pretrained("model_temp")
    tokenizer.save_pretrained("model_temp")

vLLM:

    outputs = llm.generate(
        prompts=conversations,
        sampling_params=sampling_params,
        lora_request=LoRARequest("adapter", 1, lora_path)
    )

Adapter loading works fine for Gemma 3 / Mistral / Qwen. I'm pretty sure it's a vLLM bug, since it works fine with FastModel from unsloth.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue with Unsloth-trained Gemma 4 Adapters being ineffective in vLLM may be resolved by investigating the adapter weight integration or addressing compatibility in the vLLM engine.

Guidance

  • Verify the adapter loading process in vLLM by checking the lora_path and ensuring it points to the correct location of the saved adapter weights.
  • Compare the adapter integration code in vLLM with the working implementation in FastModel from Unsloth to identify potential differences or incompatibilities.
  • Test the adapter with a different model or configuration to isolate the issue and determine if it's specific to the 31B model or Gemma 4.
  • Review the PR #40321 for bitsandbytes compatibility to see if there are any relevant changes or insights that could be applied to the LoRA adapters.

Example

No specific code example is provided due to the lack of detailed implementation details in the issue.

Notes

The issue may be specific to the vLLM engine or the Gemma 4 model, and further investigation is needed to determine the root cause.

Recommendation

Apply workaround: Investigate and address the potential incompatibility in the vLLM engine, as the issue seems to be specific to vLLM and not present in FastModel from Unsloth.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - ✅(Solved) Fix [Bug]: Gemma 4: Unsloth LoRA adapters are ignored during inference despite successful loading [1 pull requests, 3 comments, 3 participants]