vllm - 💡(How to fix) Fix [Bug]: Turboquant attention crashes on A100 when serving BF16 models with FP8 KV cache [1 participants]

vllm2026-04-16 08:59:33

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39992•Fetched 2026-04-17 08:27:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Zoofission

Participants

Zoofission

Timeline (top)

labeled ×1renamed ×1

Error Message

File "triton/language/extra/cuda/utils.py", line 104, in convert_custom_float8_sm80 return convert_custom_float8(arg, dst_ty, fp_downcast_rounding, has_minx2=True, ...)

File "triton/language/extra/cuda/utils.py", line 94, in convert_custom_float8 assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32() AssertionError

triton.compiler.errors.CompilationError: at 45:12: k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv) ^

Root Cause

The Triton kernel _tq_fused_store_fp8 in vllm/v1/attention/ops/triton_turboquant_store.py:189 directly casts KV values to FP8:

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

On SM80 (A100), Triton uses a software FP8 conversion path (convert_custom_float8_sm80) which only accepts FP16 or FP32 inputs:

# triton/language/extra/cuda/utils.py:94
assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()

When the model uses BF16, k_vals is loaded as BF16, and this assertion fails. This is SM80-specific — on SM89+ (H100), Triton has native FP8 hardware support and may use a different code path.

Code Example

- vLLM version: main (commit f4ddaf8cf)
- GPU: A100 80GB (SM80)
- Python: 3.12
- Triton: 3.x

---

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

---

# triton/language/extra/cuda/utils.py:94
assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()

---

.venv/bin/python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-3B-Instruct \
    --kv-cache-dtype turboquant_k8v4 \
    --enforce-eager

---

File "triton/language/extra/cuda/utils.py", line 104, in convert_custom_float8_sm80
    return convert_custom_float8(arg, dst_ty, fp_downcast_rounding, has_minx2=True, ...)

File "triton/language/extra/cuda/utils.py", line 94, in convert_custom_float8
    assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()
AssertionError

triton.compiler.errors.CompilationError: at 45:12:
    k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)
            ^

---

# Before (fails for BF16 on SM80):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)

# After (works for all dtypes on all architectures):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32)

RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>

- vLLM version: main (commit f4ddaf8cf)
- GPU: A100 80GB (SM80)
- Python: 3.12
- Triton: 3.x

</details>

🐛 Describe the bug

When serving a BF16 model with the turboquant attention backend on A100 (SM80), the engine crashes during initialization with an AssertionError from Triton's convert_custom_float8.

Root cause

The Triton kernel _tq_fused_store_fp8 in vllm/v1/attention/ops/triton_turboquant_store.py:189 directly casts KV values to FP8:

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

On SM80 (A100), Triton uses a software FP8 conversion path (convert_custom_float8_sm80) which only accepts FP16 or FP32 inputs:

# triton/language/extra/cuda/utils.py:94
assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()

When the model uses BF16, k_vals is loaded as BF16, and this assertion fails. This is SM80-specific — on SM89+ (H100), Triton has native FP8 hardware support and may use a different code path.

How to reproduce

  .venv/bin/python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-3B-Instruct \
    --kv-cache-dtype turboquant_k8v4 \
    --enforce-eager

Any BF16 model on A100 with the turboquant attention backend will trigger this.

Error traceback

File "triton/language/extra/cuda/utils.py", line 104, in convert_custom_float8_sm80
    return convert_custom_float8(arg, dst_ty, fp_downcast_rounding, has_minx2=True, ...)

File "triton/language/extra/cuda/utils.py", line 94, in convert_custom_float8
    assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()
AssertionError

triton.compiler.errors.CompilationError: at 45:12:
    k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)
            ^

Suggested fix

Cast to FP32 before the FP8 conversion in triton_turboquant_store.py:188. This is safe and lossless for all input dtypes (FP16, BF16, FP32):

# Before (fails for BF16 on SM80):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)

# After (works for all dtypes on all architectures):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32)

PR: #39988

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Cast the loaded k_vals to FP32 before converting to FP8 in triton_turboquant_store.py to resolve the AssertionError on A100 (SM80) with BF16 models.

Guidance

The issue arises from the direct casting of KV values to FP8, which fails when the model uses BF16 on A100 (SM80) due to Triton's software FP8 conversion path only accepting FP16 or FP32 inputs.
To verify the fix, run the provided reproduction command with the modified triton_turboquant_store.py file.
The suggested fix involves changing the line k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0) to k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32) to ensure a lossless conversion for all input dtypes.
This fix should be applied to the triton_turboquant_store.py file at line 188.

Example

The modified code snippet would look like this:

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

Notes

This fix is specific to the A100 (SM80) architecture and may not be necessary for other architectures like H100 (SM89+), which have native FP8 hardware support.

Recommendation

Apply the suggested workaround by casting k_vals to FP32 before converting to FP8, as this is a safe and lossless conversion for all input dtypes.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #authentication issue #prompt issue #agent setup #task chaining

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: Turboquant attention crashes on A100 when serving BF16 models with FP8 KV cache [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Root cause

How to reproduce

Error traceback

Suggested fix

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Turboquant attention crashes on A100 when serving BF16 models with FP8 KV cache [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Code Example

Your current environment

🐛 Describe the bug

Root cause

How to reproduce

Error traceback

Suggested fix

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING