vllm - 💡(How to fix) Fix [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report [1 participants]

ALKdevas · 2026-03-18T04:51:37Z

[vllm] Your current environment Posting as a field report since I couldn't find existing documentation for this combination. Setup: - GPU: NVIDIA GeForce RTX 5… ### Your current environment Posting as a field report since I couldn't find existing documentation for this combination. **Setup:** - GPU: NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / SM_120) - OS: Windows 11 + WSL2 - PyTorch: 2.10.0+cu130 - vLLM: 0.17.2rc1.dev45+g761e0aa7a **Root cause:** SM_120 is forced to bfloat16. Standard `--quantization awq` requires float16 → immediate crash with pydantic ValidationError. **Working fix:** ```bash --quantization awq_marlin --attention-backend TRITON_ATTN ``` **Confirmed working — three architectures:** - hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 (8B) — 338ms - casperhansen/mistral-nemo-instruct-2407-awq (12B) — 437ms - Qwen/Qwen2.5-14B-Instruct-AWQ (14B) — 520ms Confirmed NOT working on SM_120: standard awq, gptq, bitsandbytes, FlashAttention. Hope this is useful for SM_120 support going forward. ### 🐛 Describe the bug As above. not a bug a fix: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report ### Before submitting a new issue... - [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

vllm2026-03-18 04:51:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37387•Fetched 2026-04-08 00:53:11

View on GitHub

Comments

Participants

Timeline

Reactions

Author

ALKdevas

Participants

ALKdevas

Timeline (top)

closed ×1labeled ×1

Root Cause

Root cause: SM_120 is forced to bfloat16. Standard --quantization awq requires float16 → immediate crash with pydantic ValidationError.

Code Example

--quantization awq_marlin
--attention-backend TRITON_ATTN

RAW_BUFFERClick to expand / collapse

Your current environment

Posting as a field report since I couldn't find existing documentation for this combination.

Setup:

GPU: NVIDIA GeForce RTX 5060 Ti (compute capability 12.0 / SM_120)
OS: Windows 11 + WSL2
PyTorch: 2.10.0+cu130
vLLM: 0.17.2rc1.dev45+g761e0aa7a

Root cause: SM_120 is forced to bfloat16. Standard --quantization awq requires float16 → immediate crash with pydantic ValidationError.

Working fix:

--quantization awq_marlin
--attention-backend TRITON_ATTN

Confirmed working — three architectures:

hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 (8B) — 338ms
casperhansen/mistral-nemo-instruct-2407-awq (12B) — 437ms
Qwen/Qwen2.5-14B-Instruct-AWQ (14B) — 520ms

Confirmed NOT working on SM_120: standard awq, gptq, bitsandbytes, FlashAttention.

Hope this is useful for SM_120 support going forward.

🐛 Describe the bug

As above. not a bug a fix: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To resolve the issue with SM_120 and PyTorch, follow these steps:

Use the awq_marlin quantization method instead of the standard awq.
Enable the TRITON attention backend by setting --attention-backend TRITON_ATTN.

Example Configuration

--quantization awq_marlin
--attention-backend TRITON_ATTN

Code Changes

No code changes are required, as this is a configuration fix. However, ensure that your PyTorch and vLLM versions are compatible with the specified setup.

Verification

Verify that the fix worked by checking the model's performance on your specific architecture. You can use the following models as a reference:

hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4 (8B)
casperhansen/mistral-nemo-instruct-2407-awq (12B)
Qwen/Qwen2.5-14B-Instruct-AWQ (14B)

Measure the model's inference time and compare it to the reported values (338ms, 437ms, and 520ms respectively). If the model runs without crashing and the performance is similar, the fix has been successfully applied.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #prompt issue #agent setup #task chaining #parallel task #integration issue

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Example Configuration

Code Changes

Verification

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: [SM_120 / Blackwell] AWQ working with awq_marlin + TRITON_ATTN — field report [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

Fix Plan

Example Configuration

Code Changes

Verification

Still need to ship something?

RELATED_DISCOVERY

TRENDING