vllm - 💡(How to fix) Fix [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37387Fetched 2026-04-08 00:53:11
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
closed ×1labeled ×1

Root Cause

Root cause: SM_120 is forced to bfloat16. Standard --quantization awq requires float16 → immediate crash with pydantic ValidationError.

Code Example

--quantization awq_marlin
--attention-backend TRITON_ATTN
RAW_BUFFERClick to expand / collapse

Your current environment

Posting as a field report since I couldn't find existing documentation for this combination.

Setup:

  • GPU: NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / SM_120)
  • OS: Windows 11 + WSL2
  • PyTorch: 2.10.0+cu130
  • vLLM: 0.17.2rc1.dev45+g761e0aa7a

Root cause: SM_120 is forced to bfloat16. Standard --quantization awq requires float16 → immediate crash with pydantic ValidationError.

Working fix:

--quantization awq_marlin
--attention-backend TRITON_ATTN

Confirmed working — three architectures:

  • hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 (8B) — 338ms
  • casperhansen/mistral-nemo-instruct-2407-awq (12B) — 437ms
  • Qwen/Qwen2.5-14B-Instruct-AWQ (14B) — 520ms

Confirmed NOT working on SM_120: standard awq, gptq, bitsandbytes, FlashAttention.

Hope this is useful for SM_120 support going forward.

🐛 Describe the bug

As above. not a bug a fix: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue with SM_120 and PyTorch, follow these steps:

  • Use the awq_marlin quantization method instead of the standard awq.
  • Enable the TRITON attention backend by setting --attention-backend TRITON_ATTN.

Example Configuration

--quantization awq_marlin
--attention-backend TRITON_ATTN

Code Changes

No code changes are required, as this is a configuration fix. However, ensure that your PyTorch and vLLM versions are compatible with the specified setup.

Verification

Verify that the fix worked by checking the model's performance on your specific architecture. You can use the following models as a reference:

  • hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 (8B)
  • casperhansen/mistral-nemo-instruct-2407-awq (12B)
  • Qwen/Qwen2.5-14B-Instruct-AWQ (14B)

Measure the model's inference time and compare it to the reported values (338ms, 437ms, and 520ms respectively). If the model runs without crashing and the performance is similar, the fix has been successfully applied.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING