vllm - 💡(How to fix) Fix [Bug]: Turboquant attention crashes on A100 when serving BF16 models with FP8 KV cache [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#39992Fetched 2026-04-17 08:27:55
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Participants
Timeline (top)
labeled ×1renamed ×1

Error Message

File "triton/language/extra/cuda/utils.py", line 104, in convert_custom_float8_sm80 return convert_custom_float8(arg, dst_ty, fp_downcast_rounding, has_minx2=True, ...)

File "triton/language/extra/cuda/utils.py", line 94, in convert_custom_float8 assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32() AssertionError

triton.compiler.errors.CompilationError: at 45:12: k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv) ^

Root Cause

The Triton kernel _tq_fused_store_fp8 in vllm/v1/attention/ops/triton_turboquant_store.py:189 directly casts KV values to FP8:

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

On SM80 (A100), Triton uses a software FP8 conversion path (convert_custom_float8_sm80) which only accepts FP16 or FP32 inputs:

# triton/language/extra/cuda/utils.py:94
assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()

When the model uses BF16, k_vals is loaded as BF16, and this assertion fails. This is SM80-specific — on SM89+ (H100), Triton has native FP8 hardware support and may use a different code path.

Code Example

- vLLM version: main (commit f4ddaf8cf)
- GPU: A100 80GB (SM80)
- Python: 3.12
- Triton: 3.x

---

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

---

# triton/language/extra/cuda/utils.py:94
assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()

---

.venv/bin/python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-3B-Instruct \
    --kv-cache-dtype turboquant_k8v4 \
    --enforce-eager

---

File "triton/language/extra/cuda/utils.py", line 104, in convert_custom_float8_sm80
    return convert_custom_float8(arg, dst_ty, fp_downcast_rounding, has_minx2=True, ...)

File "triton/language/extra/cuda/utils.py", line 94, in convert_custom_float8
    assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()
AssertionError

triton.compiler.errors.CompilationError: at 45:12:
    k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)
            ^

---

# Before (fails for BF16 on SM80):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)

# After (works for all dtypes on all architectures):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32)
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
- vLLM version: main (commit f4ddaf8cf)
- GPU: A100 80GB (SM80)
- Python: 3.12
- Triton: 3.x
</details>

🐛 Describe the bug

When serving a BF16 model with the turboquant attention backend on A100 (SM80), the engine crashes during initialization with an AssertionError from Triton's convert_custom_float8.

Root cause

The Triton kernel _tq_fused_store_fp8 in vllm/v1/attention/ops/triton_turboquant_store.py:189 directly casts KV values to FP8:

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

On SM80 (A100), Triton uses a software FP8 conversion path (convert_custom_float8_sm80) which only accepts FP16 or FP32 inputs:

# triton/language/extra/cuda/utils.py:94
assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()

When the model uses BF16, k_vals is loaded as BF16, and this assertion fails. This is SM80-specific — on SM89+ (H100), Triton has native FP8 hardware support and may use a different code path.

How to reproduce

  .venv/bin/python -m vllm.entrypoints.openai.api_server \
    --model Qwen/Qwen2.5-3B-Instruct \
    --kv-cache-dtype turboquant_k8v4 \
    --enforce-eager

Any BF16 model on A100 with the turboquant attention backend will trigger this.

Error traceback

File "triton/language/extra/cuda/utils.py", line 104, in convert_custom_float8_sm80
    return convert_custom_float8(arg, dst_ty, fp_downcast_rounding, has_minx2=True, ...)

File "triton/language/extra/cuda/utils.py", line 94, in convert_custom_float8
    assert arg.type.scalar.is_fp16() or arg.type.scalar.is_fp32()
AssertionError

triton.compiler.errors.CompilationError: at 45:12:
    k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)
            ^

Suggested fix

Cast to FP32 before the FP8 conversion in triton_turboquant_store.py:188. This is safe and lossless for all input dtypes (FP16, BF16, FP32):

# Before (fails for BF16 on SM80):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0)

# After (works for all dtypes on all architectures):
k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32)

PR: #39988

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

Cast the loaded k_vals to FP32 before converting to FP8 in triton_turboquant_store.py to resolve the AssertionError on A100 (SM80) with BF16 models.

Guidance

  • The issue arises from the direct casting of KV values to FP8, which fails when the model uses BF16 on A100 (SM80) due to Triton's software FP8 conversion path only accepting FP16 or FP32 inputs.
  • To verify the fix, run the provided reproduction command with the modified triton_turboquant_store.py file.
  • The suggested fix involves changing the line k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0) to k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32) to ensure a lossless conversion for all input dtypes.
  • This fix should be applied to the triton_turboquant_store.py file at line 188.

Example

The modified code snippet would look like this:

k_vals = tl.load(Key_ptr + base + d_offs, mask=d_mask, other=0.0).to(tl.float32)
k_fp8 = k_vals.to(tl.float8e4b15) if FP8_E4B15 else k_vals.to(tl.float8e4nv)

Notes

This fix is specific to the A100 (SM80) architecture and may not be necessary for other architectures like H100 (SM89+), which have native FP8 hardware support.

Recommendation

Apply the suggested workaround by casting k_vals to FP32 before converting to FP8, as this is a safe and lossless conversion for all input dtypes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Bug]: Turboquant attention crashes on A100 when serving BF16 models with FP8 KV cache [1 participants]