vllm - 💡(How to fix) Fix [Feature]: W6A16 Support [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36916Fetched 2026-04-08 00:43:40
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Participants
Timeline (top)
labeled ×1
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Please consider adding W6A16 support for VLLM/LLM-Compressor.

I'm aware it may be as slow as W8A16. My priority is VRAM and accuracy. W4A16 is good, but not accurate enough for me. I am VRAM constrained even with 4 GPU as I use high context.

I have multiple 3090's. Last I checked, I cannot quant K,V cache on Ampere as VLLM does not support it. This makes it difficult for me.

Alternatives

None

Additional context

N/A

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To add W6A16 support for VLLM/LLM-Compressor, we need to modify the existing code to accommodate the new data type.

  • Update the data type enumeration to include W6A16
  • Modify the quantization function to support W6A16
  • Update the cache storage to handle W6A16 data

Example Code

# Update data type enumeration
from enum import Enum
class DataType(Enum):
    W4A16 = 1
    W6A16 = 2
    W8A16 = 3

# Modify quantization function
def quantize(data, data_type):
    if data_type == DataType.W6A16:
        # Implement W6A16 quantization logic
        return data >> 2
    elif data_type == DataType.W4A16:
        # Implement W4A16 quantization logic
        return data >> 4
    elif data_type == DataType.W8A16:
        # Implement W8A16 quantization logic
        return data

# Update cache storage
class Cache:
    def __init__(self, data_type):
        self.data_type = data_type
        self.cache = {}

    def store(self, key, value):
        if self.data_type == DataType.W6A16:
            # Store W6A16 data in cache
            self.cache[key] = value
        else:
            # Store other data types in cache
            self.cache[key] = value

Verification

To verify the fix, test the updated code with W6A16 data and check for accuracy and VRAM usage.

Extra Tips

  • Make sure to update the documentation to reflect the new data type support.
  • Test the updated code thoroughly to ensure it works as expected.
  • Consider adding support for other data types in the future.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: W6A16 Support [1 participants]