vllm - 💡(How to fix) Fix [Bug]: LoRA on AWQ-quantized Llama-3.1-8B / Llama-3.2-3B produces degenerate output (same LoRA infra works on AWQ Mistral and on FP8 / BF16 Llama) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#42488Fetched 2026-05-14 03:29:46
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
labeled ×1subscribed ×1

When serving an AWQ-quantized Llama-3 base model with a LoRA adapter, output is degenerate:

  • ~45% of requests have neither expected label token in top_logprobs=20 — the output distribution is shifted off the target subset entirely.
  • The remaining ~55% produce a near-constant prediction (recall ≈ 1.0, precision ≈ 0.08 on a binary toxicity classification, or the inverse: recall ≈ 0, precision ≈ 0 depending on which Llama variant).
  • AUROC ≈ 0.58, vs ≈ 0.98 on FP8 / BF16 Llama with the same LoRA.

The same Punica + LoRA wrapper code paths handle Mistral-7B-Instruct-v0.3 AWQ + LoRA correctly (AUROC 0.97), and the Llama AWQ base without LoRA produces perfectly sensible zero-shot output. So the failure is specifically at the intersection Llama-family architecture × AWQ-quantized base × LoRA.

Root Cause

When serving an AWQ-quantized Llama-3 base model with a LoRA adapter, output is degenerate:

  • ~45% of requests have neither expected label token in top_logprobs=20 — the output distribution is shifted off the target subset entirely.
  • The remaining ~55% produce a near-constant prediction (recall ≈ 1.0, precision ≈ 0.08 on a binary toxicity classification, or the inverse: recall ≈ 0, precision ≈ 0 depending on which Llama variant).
  • AUROC ≈ 0.58, vs ≈ 0.98 on FP8 / BF16 Llama with the same LoRA.

The same Punica + LoRA wrapper code paths handle Mistral-7B-Instruct-v0.3 AWQ + LoRA correctly (AUROC 0.97), and the Llama AWQ base without LoRA produces perfectly sensible zero-shot output. So the failure is specifically at the intersection Llama-family architecture × AWQ-quantized base × LoRA.

Fix Action

Fix / Workaround

Workaround in production

Code Example

vLLM: 0.20.0
torch: 2.11.0+cu130
transformers: 4.57.6
Python: 3.12.9
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120), 97 GB
CUDA: 13.0
OS: Debian Linux 6.1

---

# Broken: Llama-3.1-8B AWQ + toxicity LoRA
python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --enable-lora \
  --lora-modules toxicity=rungalileo/llama-3.1-8b-toxicity-finetuned-lora-weights \
  --max-model-len 8192 --enforce-eager

# Working with same LoRA infra (Mistral)
python -m vllm.entrypoints.openai.api_server \
  --model solidrust/Mistral-7B-Instruct-v0.3-AWQ \
  --enable-lora \
  --lora-modules toxicity=rungalileo/mistral7B-toxicity-lora-weights \
  --max-model-len 8192 --enforce-eager

---

{
  "r": 16,
  "lora_alpha": 16,
  "bias": "none",
  "peft_type": "LORA",
  "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
}
RAW_BUFFERClick to expand / collapse

Your current environment

Environment

vLLM: 0.20.0
torch: 2.11.0+cu130
transformers: 4.57.6
Python: 3.12.9
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120), 97 GB
CUDA: 13.0
OS: Debian Linux 6.1

🐛 Describe the bug

Summary

When serving an AWQ-quantized Llama-3 base model with a LoRA adapter, output is degenerate:

  • ~45% of requests have neither expected label token in top_logprobs=20 — the output distribution is shifted off the target subset entirely.
  • The remaining ~55% produce a near-constant prediction (recall ≈ 1.0, precision ≈ 0.08 on a binary toxicity classification, or the inverse: recall ≈ 0, precision ≈ 0 depending on which Llama variant).
  • AUROC ≈ 0.58, vs ≈ 0.98 on FP8 / BF16 Llama with the same LoRA.

The same Punica + LoRA wrapper code paths handle Mistral-7B-Instruct-v0.3 AWQ + LoRA correctly (AUROC 0.97), and the Llama AWQ base without LoRA produces perfectly sensible zero-shot output. So the failure is specifically at the intersection Llama-family architecture × AWQ-quantized base × LoRA.

Reproducer

# Broken: Llama-3.1-8B AWQ + toxicity LoRA
python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --enable-lora \
  --lora-modules toxicity=rungalileo/llama-3.1-8b-toxicity-finetuned-lora-weights \
  --max-model-len 8192 --enforce-eager

# Working with same LoRA infra (Mistral)
python -m vllm.entrypoints.openai.api_server \
  --model solidrust/Mistral-7B-Instruct-v0.3-AWQ \
  --enable-lora \
  --lora-modules toxicity=rungalileo/mistral7B-toxicity-lora-weights \
  --max-model-len 8192 --enforce-eager

Then send a chat-completions request with the standard binary toxicity prompt + max_tokens=1, logprobs=true, top_logprobs=20, temperature=0. Compute P(toxic) = Σ exp(logprob) over tokens normalizing to "true" over Σ over {"true","false"}.

Note: both LoRA repos and the evaluation dataset (rungalileo/automated-ft-luna-toxicity) are gated on the rungalileo HF org. A maintainer reproducing without that access can swap in any other Llama-3.1-8B PEFT-format LoRA — the failure mode is independent of the specific LoRA file (verified across two LoRA target sets — see "ruled out" table below).

Failure scope

Dataset: 467 binary-labeled toxicity samples. Inference: chat/completions with max_tokens=1, logprobs=true, top_logprobs=20, temperature=0.

BaseQuantLoRAAUROCRecallPrecisionN_NaN/467Result
Llama-3.1-8B-Instruct (meta-llama, BF16)nonetoxicity0.9750.820.910✅ baseline
Llama-3.1-8B-Instruct (RedHatAI FP8)FP8 W8A8toxicity0.9870.920.900
Llama-3.1-8B-Instruct (nvidia NVFP4)NVFP4 + FP8 KVtoxicity0.9880.920.910
Llama-3.1-8B-Instruct (hugging-quants AWQ)AWQ W4A16toxicity0.581.000.08259
Llama-3.1-8B-Instruct (hugging-quants AWQ)AWQ W4A16nonen/an/an/a0✅ zero-shot
Llama-3.2-3B-Instruct (RedHatAI FP8)FP8 W8A8toxicity0.9820.880.870
Llama-3.2-3B-Instruct (AMead10 AWQ)AWQ W4A16toxicitynan0.000.00varies❌ (inverse direction)
Mistral-7B-Instruct-v0.3 (RedHatAI FP8)FP8 W8A8toxicity (Mistral)0.960.830.900
Mistral-7B-Instruct-v0.3 (solidrust AWQ)AWQ W4A16toxicity (Mistral)0.970.790.940

Variables independently ruled out

Each of these was tested in isolation on the broken (Llama-3.1-8B AWQ + LoRA) configuration; all reproduced the same degenerate behavior:

VariableValues testedAll same?
--kv-cache-dtypeauto, fp8, fp8_per_token_head, int8_per_token_headYes, all degenerate
AWQ kernelawq_marlin (the default auto-conversion) and --quantization awq (forces non-marlin; log confirms forcing awq)Yes
--enforce-eager (disables CUDA Graph + torch.compile)on / offYes
Llama AWQ checkpoint providerhugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 (509K HF downloads), AMead10/Llama-3.2-3B-Instruct-AWQ (3.5K)Both broken (one biased to all-positive, the other to all-negative)
Llama base size3B and 8BBoth broken
Equivalent setup on Mistral-7B-v0.3 AWQsolidrust/Mistral-7B-Instruct-v0.3-AWQWorks correctly

What the LoRA looks like

All three toxicity LoRAs tested have byte-identical PEFT config:

{
  "r": 16,
  "lora_alpha": 16,
  "bias": "none",
  "peft_type": "LORA",
  "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
}

MistralForCausalLM inherits LlamaForCausalLM in vllm/model_executor/models/mistral.py, so both architectures share the same packed_modules_mapping (qkv_proj: [q,k,v], gate_up_proj: [gate, up]) and the same LoRA wrapper layer types:

  • qkv_proj, gate_up_projMergedColumnParallelLinearWithLoRA.apply_mcp_apply (calls Punica.add_shrink / add_expand directly)
  • o_proj, down_projRowParallelLinearWithLoRA._apply_syncPunica.add_lora_linear

What this implicates

Given the failure is independent of AWQ kernel variant, KV cache, and CUDA Graph, and the same code paths handle Mistral AWQ + LoRA correctly, the bug appears specific to how AWQ-quantized Llama weights interact with the merged-column-parallel LoRA path (_mcp_applyadd_shrink/add_expand). One plausible mechanism: the AWQ-Marlin (or non-Marlin AWQ) process_weights_after_loading weight repack (_convert_awq_to_standard_format on qweight/qzeros) leaves the merged qkv_proj/gate_up_proj weight in a layout that doesn't compose correctly with the LoRA expand step's output_slices accounting. This is a hypothesis only; we did not have time to fully isolate the line.

Workaround in production

On Llama bases, use FP8 or NVFP4 instead of AWQ when LoRA is required. FP8 (RedHatAI/...FP8) and NVFP4 (nvidia/...NVFP4) both work correctly with the same LoRA adapters on vLLM 0.20. AWQ on Mistral is unaffected.

Steps already taken

  1. Reproduced across 5 independent configurations on Llama-3.1-8B AWQ + LoRA (3 KV variants × 2 AWQ kernel variants × eager toggle) — all consistently degenerate
  2. Confirmed Llama AWQ base alone (no LoRA) produces correct zero-shot output
  3. Diffed vllm/model_executor/models/llama.py vs mistral.py (Mistral inherits LlamaForCausalLM)
  4. Confirmed identical PEFT target modules across all three LoRA adapters
  5. Searched existing issues; the closest are #10798 (Llama LoRA mismatch fixed in #6909, BF16-only), #41754 (Gemma 4 + Unsloth LoRA silently ignored, open), #21471 (LoRA + TP corrupted) — none cover this case

cc anyone touching vllm/model_executor/layers/quantization/awq_marlin.py, vllm/model_executor/layers/quantization/awq.py, or vllm/lora/punica_wrapper/punica_gpu.py since 0.19.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: LoRA on AWQ-quantized Llama-3.1-8B / Llama-3.2-3B produces degenerate output (same LoRA infra works on AWQ Mistral and on FP8 / BF16 Llama) [1 participants]