vllm - ✅(Solved) Fix [Feature]: Add LoRA support for Qwen3ASRForConditionalGeneration [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37223Fetched 2026-04-08 00:48:39
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Participants
Assignees
Timeline (top)
cross-referenced ×3referenced ×3labeled ×2assigned ×1

Error Message

File ".../vllm/v1/worker/gpu_model_runner.py", line 4301, in load_model self.model = self.load_lora_model( File ".../vllm/v1/worker/lora_model_runner_mixin.py", line 38, in load_lora_model raise ValueError(f"{model.class.name} does not support LoRA yet.") ValueError: Qwen3ASRForConditionalGeneration does not support LoRA yet.

Fix Action

Fixed

PR fix notes

PR #37247: [Model] Implement LoRA support for Qwen3ASRForConditionalGeneration

Description (problem / solution / changelog)

Purpose

This PR adds LoRA support for Qwen3ASRForConditionalGeneration model.

For this to work for the audio tower, I had to make a few additional changes:

  • Implement get_num_mm_encoder_tokens()
  • Replaced some nn.Linears with ReplicatedLinear along the audio tower path.
  • Qwen3ASR seems to be our first model with a tower, but no connector. In gpu_model_runner.py, I found that the hasattr(self.model, "get_num_mm_connector_tokens") was improperly evaluating to True due to inheritance, despite the model not implementing get_num_mm_connector_tokens(). This was leading us to incorrectly go down that path and encounter an error. I've modified the condition to check if connector actually exists in the mapping.

Fixes #37223

Test Plan

I tested on the following public adaptor available on HuggingFace: ha0yuan/Qwen3-ASR-LoRa-ChineseAviation-Tiny.

vllm serve Qwen/Qwen3-ASR-1.7B \
  --enable-lora \
  --enable-tower-connector-lora \
  --lora-modules aviation=ha0yuan/Qwen3-ASR-LoRa-ChineseAviation-Tiny \
  --port 8000

I also double-checked that the adapters are properly shown when querying the /v1/models endpoint:

curl localhost:8000/v1/models | jq .
{
  "object": "list",
  "data": [
    {
      "id": "Qwen/Qwen3-ASR-1.7B",
      "object": "model",
      ...
      "root": "Qwen/Qwen3-ASR-1.7B",
      ...
    },
    {
      "id": "medpl",
       ...
      "root": "AleksanderObuchowski/Qwen3-ASR-1.7B-med-pl-lora-decoder-only",
      ...
    }
}

Then I used a Python script to load in a .wav file and query the /v1/chat/completions endpoint. Specifically, I used this audio file as input.

Test Result

Before this PR, the server would error with ValueError: Qwen3ASRForConditionalGeneration does not support LoRA yet.

After this change, the server starts up properly, and I successfully queried the /v1/chat/completions endpoint.

Querying both the raw model and the adaptor, I verified that the output differs when the LoRa adaptor is enabled. The outputs are below. (Notice, the raw model transcribes to numbers (e.g 9 and 10), while the adaptor transcribes the numbers to words ("nine" and "ten).

== no LoRA ==
language English<asr_text>November the 10th, Wednesday, 9 p.m. I'm standing in a dark alley. After waiting several hours, the time has come. A woman with long dark hair approaches. I have to act, and fast, before she realizes what has happened. I must find out.

== with LoRA ==
language English<asr_text>November the tenth, Wednesday, nine p.m. I'm standing in a dark alley. After waiting several hours, the time has come. A woman with long dark hair approaches. I have to act, and fast, before she realizes what has happened. I must find out.

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • docs/models/supported_models.md (modified, +1/-1)
  • vllm/model_executor/models/qwen3_asr.py (modified, +26/-0)
  • vllm/model_executor/models/qwen3_omni_moe_thinker.py (modified, +22/-3)
  • vllm/v1/worker/gpu_model_runner.py (modified, +14/-1)

Code Example

vllm serve Qwen/Qwen3-ASR-0.6B \
    --enable-lora \
    --lora-modules my-adapter=./my-lora-adapter/ \
    --max-lora-rank 256 \
    --gpu-memory-utilization 0.5

---

File ".../vllm/v1/worker/gpu_model_runner.py", line 4301, in load_model
    self.model = self.load_lora_model(
  File ".../vllm/v1/worker/lora_model_runner_mixin.py", line 38, in load_lora_model
    raise ValueError(f"{model.__class__.__name__} does not support LoRA yet.")
ValueError: Qwen3ASRForConditionalGeneration does not support LoRA yet.
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

When attempting to serve Qwen/Qwen3-ASR-0.6B with LoRA enabled using the following command:

vllm serve Qwen/Qwen3-ASR-0.6B \
    --enable-lora \
    --lora-modules my-adapter=./my-lora-adapter/ \
    --max-lora-rank 256 \
    --gpu-memory-utilization 0.5

The server fails at model load time with the following error:

File ".../vllm/v1/worker/gpu_model_runner.py", line 4301, in load_model
    self.model = self.load_lora_model(
  File ".../vllm/v1/worker/lora_model_runner_mixin.py", line 38, in load_lora_model
    raise ValueError(f"{model.__class__.__name__} does not support LoRA yet.")
ValueError: Qwen3ASRForConditionalGeneration does not support LoRA yet.

It would be great to get LoRA support for Qwen3ASRForConditionalGeneration in vllm/model_executor/models/ (similar to how other conditional generation models have been wired up), so that --enable-lora works end-to-end for this architecture.

Alternatives

Additional context

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To add LoRA support for Qwen3ASRForConditionalGeneration, follow these steps:

  • Modify the Qwen3ASRForConditionalGeneration class in vllm/model_executor/models/ to support LoRA.
  • Add a supports_lora method to the class and set it to True.
  • Implement the load_lora_model method to load the LoRA model.

Example code:

class Qwen3ASRForConditionalGeneration(Model):
    # ... existing code ...

    def supports_lora(self):
        return True

    def load_lora_model(self, lora_modules):
        # Load the LoRA model using the provided lora_modules
        # ... implementation details ...
        pass
  • Register the Qwen3ASRForConditionalGeneration class in the lora_model_runner_mixin.py file to recognize it as a LoRA-supported model.

Verification

To verify the fix, run the original command with LoRA enabled:

vllm serve Qwen/Qwen3-ASR-0.6B \
    --enable-lora \
    --lora-modules my-adapter=./my-lora-adapter/ \
    --max-lora-rank 256 \
    --gpu-memory-utilization

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING