vllm - 💡(How to fix) Fix [Bug]: LoRA on AWQ-quantized Llama-3.1-8B / Llama-3.2-3B produces degenerate output (same LoRA infra works on AWQ Mistral and on FP8 / BF16 Llama) [1 participants]

vllm2026-05-13 06:07:55

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#42488•Fetched 2026-05-14 03:29:46

View on GitHub

Comments

Participants

Timeline

Reactions

Author

langzhao-netizen

Participants

langzhao-netizen

Timeline (top)

labeled ×1subscribed ×1

When serving an AWQ-quantized Llama-3 base model with a LoRA adapter, output is degenerate:

~45% of requests have neither expected label token in top_logprobs=20 — the output distribution is shifted off the target subset entirely.
The remaining ~55% produce a near-constant prediction (recall ≈ 1.0, precision ≈ 0.08 on a binary toxicity classification, or the inverse: recall ≈ 0, precision ≈ 0 depending on which Llama variant).
AUROC ≈ 0.58, vs ≈ 0.98 on FP8 / BF16 Llama with the same LoRA.

The same Punica + LoRA wrapper code paths handle Mistral-7B-Instruct-v0.3 AWQ + LoRA correctly (AUROC 0.97), and the Llama AWQ base without LoRA produces perfectly sensible zero-shot output. So the failure is specifically at the intersection Llama-family architecture × AWQ-quantized base × LoRA.

Root Cause

When serving an AWQ-quantized Llama-3 base model with a LoRA adapter, output is degenerate:

~45% of requests have neither expected label token in top_logprobs=20 — the output distribution is shifted off the target subset entirely.
The remaining ~55% produce a near-constant prediction (recall ≈ 1.0, precision ≈ 0.08 on a binary toxicity classification, or the inverse: recall ≈ 0, precision ≈ 0 depending on which Llama variant).
AUROC ≈ 0.58, vs ≈ 0.98 on FP8 / BF16 Llama with the same LoRA.

Fix Action

Fix / Workaround

Workaround in production

Code Example

vLLM: 0.20.0
torch: 2.11.0+cu130
transformers: 4.57.6
Python: 3.12.9
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120), 97 GB
CUDA: 13.0
OS: Debian Linux 6.1

---

# Broken: Llama-3.1-8B AWQ + toxicity LoRA
python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --enable-lora \
  --lora-modules toxicity=rungalileo/llama-3.1-8b-toxicity-finetuned-lora-weights \
  --max-model-len 8192 --enforce-eager

# Working with same LoRA infra (Mistral)
python -m vllm.entrypoints.openai.api_server \
  --model solidrust/Mistral-7B-Instruct-v0.3-AWQ \
  --enable-lora \
  --lora-modules toxicity=rungalileo/mistral7B-toxicity-lora-weights \
  --max-model-len 8192 --enforce-eager

---

{
  "r": 16,
  "lora_alpha": 16,
  "bias": "none",
  "peft_type": "LORA",
  "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
}

RAW_BUFFERClick to expand / collapse

Your current environment

Environment

vLLM: 0.20.0
torch: 2.11.0+cu130
transformers: 4.57.6
Python: 3.12.9
GPU: NVIDIA RTX PRO 6000 Blackwell Server Edition (sm_120), 97 GB
CUDA: 13.0
OS: Debian Linux 6.1

🐛 Describe the bug

Summary

When serving an AWQ-quantized Llama-3 base model with a LoRA adapter, output is degenerate:

~45% of requests have neither expected label token in top_logprobs=20 — the output distribution is shifted off the target subset entirely.
The remaining ~55% produce a near-constant prediction (recall ≈ 1.0, precision ≈ 0.08 on a binary toxicity classification, or the inverse: recall ≈ 0, precision ≈ 0 depending on which Llama variant).
AUROC ≈ 0.58, vs ≈ 0.98 on FP8 / BF16 Llama with the same LoRA.

Reproducer

# Broken: Llama-3.1-8B AWQ + toxicity LoRA
python -m vllm.entrypoints.openai.api_server \
  --model hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 \
  --enable-lora \
  --lora-modules toxicity=rungalileo/llama-3.1-8b-toxicity-finetuned-lora-weights \
  --max-model-len 8192 --enforce-eager

# Working with same LoRA infra (Mistral)
python -m vllm.entrypoints.openai.api_server \
  --model solidrust/Mistral-7B-Instruct-v0.3-AWQ \
  --enable-lora \
  --lora-modules toxicity=rungalileo/mistral7B-toxicity-lora-weights \
  --max-model-len 8192 --enforce-eager

Then send a chat-completions request with the standard binary toxicity prompt + max_tokens=1, logprobs=true, top_logprobs=20, temperature=0. Compute P(toxic) = Σ exp(logprob) over tokens normalizing to "true" over Σ over {"true","false"}.

Note: both LoRA repos and the evaluation dataset (rungalileo/automated-ft-luna-toxicity) are gated on the rungalileo HF org. A maintainer reproducing without that access can swap in any other Llama-3.1-8B PEFT-format LoRA — the failure mode is independent of the specific LoRA file (verified across two LoRA target sets — see "ruled out" table below).

Failure scope

Dataset: 467 binary-labeled toxicity samples. Inference: chat/completions with max_tokens=1, logprobs=true, top_logprobs=20, temperature=0.

Base	Quant	LoRA	AUROC	Recall	Precision	N_NaN/467	Result
Llama-3.1-8B-Instruct (meta-llama, BF16)	none	toxicity	0.975	0.82	0.91	0	✅ baseline
Llama-3.1-8B-Instruct (RedHatAI FP8)	FP8 W8A8	toxicity	0.987	0.92	0.90	0	✅
Llama-3.1-8B-Instruct (nvidia NVFP4)	NVFP4 + FP8 KV	toxicity	0.988	0.92	0.91	0	✅
Llama-3.1-8B-Instruct (hugging-quants AWQ)	AWQ W4A16	toxicity	0.58	1.00	0.08	259	❌
Llama-3.1-8B-Instruct (hugging-quants AWQ)	AWQ W4A16	none	n/a	n/a	n/a	0	✅ zero-shot
Llama-3.2-3B-Instruct (RedHatAI FP8)	FP8 W8A8	toxicity	0.982	0.88	0.87	0	✅
Llama-3.2-3B-Instruct (AMead10 AWQ)	AWQ W4A16	toxicity	nan	0.00	0.00	varies	❌ (inverse direction)
Mistral-7B-Instruct-v0.3 (RedHatAI FP8)	FP8 W8A8	toxicity (Mistral)	0.96	0.83	0.90	0	✅
Mistral-7B-Instruct-v0.3 (solidrust AWQ)	AWQ W4A16	toxicity (Mistral)	0.97	0.79	0.94	0	✅

Variables independently ruled out

Each of these was tested in isolation on the broken (Llama-3.1-8B AWQ + LoRA) configuration; all reproduced the same degenerate behavior:

Variable	Values tested	All same?
`--kv-cache-dtype`	`auto`, `fp8`, `fp8_per_token_head`, `int8_per_token_head`	Yes, all degenerate
AWQ kernel	`awq_marlin` (the default auto-conversion) and `--quantization awq` (forces non-marlin; log confirms `forcing awq`)	Yes
`--enforce-eager` (disables CUDA Graph + torch.compile)	on / off	Yes
Llama AWQ checkpoint provider	`hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4` (509K HF downloads), `AMead10/Llama-3.2-3B-Instruct-AWQ` (3.5K)	Both broken (one biased to all-positive, the other to all-negative)
Llama base size	3B and 8B	Both broken
Equivalent setup on Mistral-7B-v0.3 AWQ	`solidrust/Mistral-7B-Instruct-v0.3-AWQ`	Works correctly

What the LoRA looks like

All three toxicity LoRAs tested have byte-identical PEFT config:

{
  "r": 16,
  "lora_alpha": 16,
  "bias": "none",
  "peft_type": "LORA",
  "target_modules": ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj"]
}

MistralForCausalLM inherits LlamaForCausalLM in vllm/model_executor/models/mistral.py, so both architectures share the same packed_modules_mapping (qkv_proj: [q,k,v], gate_up_proj: [gate, up]) and the same LoRA wrapper layer types:

qkv_proj, gate_up_proj → MergedColumnParallelLinearWithLoRA.apply → _mcp_apply (calls Punica.add_shrink / add_expand directly)
o_proj, down_proj → RowParallelLinearWithLoRA._apply_sync → Punica.add_lora_linear

What this implicates

Given the failure is independent of AWQ kernel variant, KV cache, and CUDA Graph, and the same code paths handle Mistral AWQ + LoRA correctly, the bug appears specific to how AWQ-quantized Llama weights interact with the merged-column-parallel LoRA path (_mcp_apply → add_shrink/add_expand). One plausible mechanism: the AWQ-Marlin (or non-Marlin AWQ) process_weights_after_loading weight repack (_convert_awq_to_standard_format on qweight/qzeros) leaves the merged qkv_proj/gate_up_proj weight in a layout that doesn't compose correctly with the LoRA expand step's output_slices accounting. This is a hypothesis only; we did not have time to fully isolate the line.

Workaround in production

On Llama bases, use FP8 or NVFP4 instead of AWQ when LoRA is required. FP8 (RedHatAI/...FP8) and NVFP4 (nvidia/...NVFP4) both work correctly with the same LoRA adapters on vLLM 0.20. AWQ on Mistral is unaffected.

Steps already taken

Reproduced across 5 independent configurations on Llama-3.1-8B AWQ + LoRA (3 KV variants × 2 AWQ kernel variants × eager toggle) — all consistently degenerate
Confirmed Llama AWQ base alone (no LoRA) produces correct zero-shot output
Diffed vllm/model_executor/models/llama.py vs mistral.py (Mistral inherits LlamaForCausalLM)
Confirmed identical PEFT target modules across all three LoRA adapters
Searched existing issues; the closest are #10798 (Llama LoRA mismatch fixed in #6909, BF16-only), #41754 (Gemma 4 + Unsloth LoRA silently ignored, open), #21471 (LoRA + TP corrupted) — none cover this case

cc anyone touching vllm/model_executor/layers/quantization/awq_marlin.py, vllm/model_executor/layers/quantization/awq.py, or vllm/lora/punica_wrapper/punica_gpu.py since 0.19.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #device allocation #model download #tokenizer error #prompt formatting

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: LoRA on AWQ-quantized Llama-3.1-8B / Llama-3.2-3B produces degenerate output (same LoRA infra works on AWQ Mistral and on FP8 / BF16 Llama) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround in production

Code Example

Your current environment

Environment

🐛 Describe the bug

Summary

Reproducer

Failure scope

Variables independently ruled out

What the LoRA looks like

What this implicates

Workaround in production

Steps already taken

Before submitting a new issue...

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: LoRA on AWQ-quantized Llama-3.1-8B / Llama-3.2-3B produces degenerate output (same LoRA infra works on AWQ Mistral and on FP8 / BF16 Llama) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Fix / Workaround

Workaround in production

Code Example

Your current environment

Environment

🐛 Describe the bug

Summary

Reproducer

Failure scope

Variables independently ruled out

What the LoRA looks like

What this implicates

Workaround in production

Steps already taken

Before submitting a new issue...

Still need to ship something?

RELATED_DISCOVERY

TRENDING