vllm - ✅(Solved) Fix [Feature]: Implement `TRITON_MLA_SPARSE` backend for sm80 support of Sparse MLA [1 pull requests, 9 comments, 5 participants]

vllm2026-03-24 13:01:10

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38006•Fetched 2026-04-08 01:21:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×9mentioned ×7subscribed ×7cross-referenced ×2

Error Message

vLLM should support DSA/sparse MLA models (GLM-5, etc.) on sm80 GPUs (A100/A800), or at minimum fail early with a clear error message.

Root Cause

GLM-5 and other models using DeepSeek Sparse Attention (DSA) cannot run on sm80 GPUs (A100/A800). There are multiple layers of sm80 incompatibility, not just a single missing fallback.

How to reproduce

python -m vllm.entrypoints.openai.api_server
--model ZhipuAI/GLM-5-FP8
--tensor-parallel-size 8
--max-model-len 8192
--cpu-offload-gb 20
--gpu-memory-utilization 0.95
--trust-remote-code
--enforce-eager
--port 8000

Root cause analysis (updated)

Correction: The original description oversimplified the problem as only a missing DeepGemm fallback in sparse_attn_indexer.py. After thorough testing, we found there are at least 3 layers of sm80 incompatibility:

Layer Problem Details 1. C++ compilation dsv3_fused_a_gemm in csrc/ops.h / csrc/torch_bindings.cpp unconditionally compiles sm90+ code Need #ifdef ENABLE_DSV3_FUSED_A_GEMM guards to skip on sm80 2. Attention backend No sparse MLA attention backend available for sm80 Upstream has FLASHMLA_SPARSE (sm90+ only). sm80 needs a Triton-based sparse MLA backend (e.g. TRITON_MLA_SPARSE) with registration in cuda.py and registry.py 3. Indexer fallback sparse_attn_indexer.py calls DeepGemm's fp8_mqa_logits / fp8_paged_mqa_logits without is_deep_gemm_supported() check Need PyTorch fallback functions when DeepGemm is unavailable

What we tested

We built a PoC that addresses all 3 layers and confirmed GLM-5-FP8 produces correct inference results on 8xA800:

Component Files What it does csrc guards csrc/ops.h, csrc/torch_bindings.cpp #ifdef ENABLE_DSV3_FUSED_A_GEMM to skip sm90+ code on sm80 Triton MLA Sparse backend triton_mla_sparse.py + triton_sparse_decode_attention.py + fp8_mqa_logits_fallback.py (3 new files) Full Triton-based sparse MLA attention backend for sm80 Backend registration cuda.py, registry.py Register TRITON_MLA_SPARSE in priority list and enum Indexer fallback sparse_attn_indexer.py, deep_gemm.py is_deep_gemm_supported() guard + fp8_mqa_logits_torch / fp8_paged_mqa_logits_torch PyTorch fallback Test config: TP=8, max_model_len=1024, cpu_offload=40GB, gpu_memory_utilization=0.92, max_num_seqs=16

Relationship with PR #35271

PR #35271 by @chaunceyjiang addresses Layer 3 (indexer fallback) with PyTorch fallback functions. This is a necessary piece, but not sufficient alone for sm80 — Layers 1 and 2 also need to be resolved.

Expected behavior

vLLM should support DSA/sparse MLA models (GLM-5, etc.) on sm80 GPUs (A100/A800), or at minimum fail early with a clear error message.

Additional context

A100/A800 GPUs are still widely deployed in datacenters

The non-sparse MLA path (e.g. DeepSeek-V3) works fine on sm80 via the Triton MLA backend — only the sparse variant is broken

What actually happened

After thorough testing on our 8xA800 cluster, we found that running GLM-5 (sparse MLA/DSA) on sm80 requires fixes at 3 separate layers, not just the indexer:

C++ compilation — dsv3_fused_a_gemm needs #ifdef guards to skip on sm80

Attention backend — sm80 has no sparse MLA backend; upstream only has FLASHMLA_SPARSE (sm90+). We had to write a full TRITON_MLA_SPARSE backend (3 new files: triton_mla_sparse.py, triton_sparse_decode_attention.py, fp8_mqa_logits_fallback.py)

Indexer fallback — sparse_attn_indexer.py DeepGemm calls need is_deep_gemm_supported() guards + PyTorch fallback

Testing PR #35271

We tested PR #35271 combined with our PoC patches (layers 1 & 2) and confirmed GLM-5-FP8 generates correct inference results on A800.

However, PR #35271 alone is not sufficient for sm80 — it only addresses layer 3. Without the attention backend and csrc fixes, the server either fails to compile or has no viable backend to select.

Also, one important note: the guard in sparse_attn_indexer.py should use is_deep_gemm_supported() (checks sm90+ arch), not has_deep_gemm() (only checks if the package is installed). We initially used has_deep_gemm() and it still called DeepGemm on sm80 because the package was installed.

Test details

Server config: TP=8, max_model_len=1024, cpu_offload=40GB, gpu_memory_utilization=0.92, max_num_seqs=16

Tested: basic chat, math, code generation, multi-turn conversation — all passed

The model's 744B parameters are very tight on 8x80GB, requiring aggressive CPU offload

I've updated the issue description to accurately reflect the full scope of the problem. Again, sorry for the initial confusion, and thank you @chaunceyjiang for the quick PR — the indexer fallback is definitely a necessary piece of the puzzle.

Fix Action

Fix / Workaround

GLM-5 and other models using DeepSeek Sparse Attention (DSA) cannot run on sm80 GPUs (A100/A800). There are multiple layers of sm80 incompatibility, not just a single missing fallback.

How to reproduce

python -m vllm.entrypoints.openai.api_server
--model ZhipuAI/GLM-5-FP8
--tensor-parallel-size 8
--max-model-len 8192
--cpu-offload-gb 20
--gpu-memory-utilization 0.95
--trust-remote-code
--enforce-eager
--port 8000

Root cause analysis (updated)

Correction: The original description oversimplified the problem as only a missing DeepGemm fallback in sparse_attn_indexer.py. After thorough testing, we found there are at least 3 layers of sm80 incompatibility:

Layer Problem Details 1. C++ compilation dsv3_fused_a_gemm in csrc/ops.h / csrc/torch_bindings.cpp unconditionally compiles sm90+ code Need #ifdef ENABLE_DSV3_FUSED_A_GEMM guards to skip on sm80 2. Attention backend No sparse MLA attention backend available for sm80 Upstream has FLASHMLA_SPARSE (sm90+ only). sm80 needs a Triton-based sparse MLA backend (e.g. TRITON_MLA_SPARSE) with registration in cuda.py and registry.py 3. Indexer fallback sparse_attn_indexer.py calls DeepGemm's fp8_mqa_logits / fp8_paged_mqa_logits without is_deep_gemm_supported() check Need PyTorch fallback functions when DeepGemm is unavailable

What we tested

We built a PoC that addresses all 3 layers and confirmed GLM-5-FP8 produces correct inference results on 8xA800:

Component Files What it does csrc guards csrc/ops.h, csrc/torch_bindings.cpp #ifdef ENABLE_DSV3_FUSED_A_GEMM to skip sm90+ code on sm80 Triton MLA Sparse backend triton_mla_sparse.py + triton_sparse_decode_attention.py + fp8_mqa_logits_fallback.py (3 new files) Full Triton-based sparse MLA attention backend for sm80 Backend registration cuda.py, registry.py Register TRITON_MLA_SPARSE in priority list and enum Indexer fallback sparse_attn_indexer.py, deep_gemm.py is_deep_gemm_supported() guard + fp8_mqa_logits_torch / fp8_paged_mqa_logits_torch PyTorch fallback Test config: TP=8, max_model_len=1024, cpu_offload=40GB, gpu_memory_utilization=0.92, max_num_seqs=16

Relationship with PR #35271

PR #35271 by @chaunceyjiang addresses Layer 3 (indexer fallback) with PyTorch fallback functions. This is a necessary piece, but not sufficient alone for sm80 — Layers 1 and 2 also need to be resolved.

Expected behavior

vLLM should support DSA/sparse MLA models (GLM-5, etc.) on sm80 GPUs (A100/A800), or at minimum fail early with a clear error message.

Additional context

A100/A800 GPUs are still widely deployed in datacenters

The non-sparse MLA path (e.g. DeepSeek-V3) works fine on sm80 via the Triton MLA backend — only the sparse variant is broken

What actually happened

After thorough testing on our 8xA800 cluster, we found that running GLM-5 (sparse MLA/DSA) on sm80 requires fixes at 3 separate layers, not just the indexer:

C++ compilation — dsv3_fused_a_gemm needs #ifdef guards to skip on sm80

Attention backend — sm80 has no sparse MLA backend; upstream only has FLASHMLA_SPARSE (sm90+). We had to write a full TRITON_MLA_SPARSE backend (3 new files: triton_mla_sparse.py, triton_sparse_decode_attention.py, fp8_mqa_logits_fallback.py)

Indexer fallback — sparse_attn_indexer.py DeepGemm calls need is_deep_gemm_supported() guards + PyTorch fallback

Testing PR #35271

We tested PR #35271 combined with our PoC patches (layers 1 & 2) and confirmed GLM-5-FP8 generates correct inference results on A800.

However, PR #35271 alone is not sufficient for sm80 — it only addresses layer 3. Without the attention backend and csrc fixes, the server either fails to compile or has no viable backend to select.

Also, one important note: the guard in sparse_attn_indexer.py should use is_deep_gemm_supported() (checks sm90+ arch), not has_deep_gemm() (only checks if the package is installed). We initially used has_deep_gemm() and it still called DeepGemm on sm80 because the package was installed.

Test details

Server config: TP=8, max_model_len=1024, cpu_offload=40GB, gpu_memory_utilization=0.92, max_num_seqs=16

Tested: basic chat, math, code generation, multi-turn conversation — all passed

The model's 744B parameters are very tight on 8x80GB, requiring aggressive CPU offload

I've updated the issue description to accurately reflect the full scope of the problem. Again, sorry for the initial confusion, and thank you @chaunceyjiang for the quick PR — the indexer fallback is definitely a necessary piece of the puzzle.

PR fix notes

PR #35271: [Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function

Repository: vllm-project/vllm
Author: chaunceyjiang
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/35271

Description (problem / solution / changelog)

Purpose

FIX https://github.com/vllm-project/vllm/issues/35021

Test Plan

# pip uninstall deep_gemm
# vllm serve /mnt/data4/models/deepseek-ai/DeepSeek-V3___2 -tp=8  --tokenizer-mode deepseek_v32 --enable-auto-tool-choice --tool-call-parser deepseek_v32 --reasoning-parser deepseek_v3 --enforce-eager
...
...
(APIServer pid=1017794) INFO 02-25 16:22:09 [launcher.py:47] Route: /v1/messages, Methods: POST
(APIServer pid=1017794) INFO 02-25 16:22:09 [launcher.py:47] Route: /inference/v1/generate, Methods: POST
(APIServer pid=1017794) INFO 02-25 16:22:09 [launcher.py:47] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=1017794) INFO 02-25 16:22:09 [launcher.py:47] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=1017794) INFO:     Started server process [1017794]
(APIServer pid=1017794) INFO:     Waiting for application startup.
(APIServer pid=1017794) INFO:     Application startup complete

Test Result

curl -XPOST http://localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
    "messages": [
        {
            "content": "Tell me a joke",
            "role": "user"
        }
    ],
    "model": "",
    "n": 1,
    "stream": false,
    "temperature": 0.5,
    "chat_template_kwargs": {"thinking": true, "enable_thinking": true}
}'

{"id":"chatcmpl-ac2d19c323329801","object":"chat.completion","created":1772007758,"model":"/mnt/data4/models/deepseek-ai/DeepSeek-V3___2","choices":[{"index":0,"message":{"role":"assistant","content":"Why did the chicken cross the road?\n\nTo get to the other side... and finally escape existential dread. 🐔","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":"Hmm, the user asked for a joke. This is a straightforward request with no complex parameters. I should pick something lighthearted and universally understandable. \n\nI recall a classic joke structure involving a chicken crossing the road. It's simple, clean, and the punchline is well-known but still effective. The chicken's motivation being existential dread adds a subtle modern twist while keeping it family-friendly. \n\nNo need to overthink this—just deliver the joke clearly and wait for the user's reaction. If they want more, I can always follow up with another."},"logprobs":null,"finish_reason":"stop","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":8,"total_tokens":147,"completion_tokens":139,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/model_executor/layers/sparse_attn_indexer.py (modified, +50/-26)
vllm/utils/deep_gemm.py (modified, +121/-0)
vllm/v1/attention/backends/mla/indexer.py (modified, +5/-2)

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Follow-up to https://github.com/vllm-project/vllm/issues/35021.

Relevant PR to implementation (probably need to re-add when TRITON_MLA_SPARSE is implemented): https://github.com/vllm-project/vllm/pull/37968, https://github.com/vllm-project/vllm/pull/35271, https://github.com/vllm-project/vllm/pull/38076, https://github.com/vllm-project/vllm/pull/36519

Required for sm80 support of Sparse MLA, such as GLM-5 and DeepSeek V3.2.

Argument in support of implementing this: More and more models are likely to be released with Sparse MLA, not just GLM-5.

One backwards compatibility implementation example was the Marlin FP8 E4M3 fallback for sm80, which allows FP8 models to run in Ampere (https://github.com/vllm-project/vllm/issues/17579, https://github.com/vllm-project/vllm/pull/18026, https://github.com/vllm-project/vllm/pull/19990, https://github.com/vllm-project/vllm/pull/24722). This is not supported in SGLang (https://github.com/sgl-project/sglang/issues/12887, https://github.com/sgl-project/sglang/pull/9754), where all Ampere users of FP8 W8A8 MoE are restricted to only vLLM.

Thus, TRITON_MLA_SPARSE should also be implemented, as this will redirect all Ampere users of Sparse MLA models to vLLM. If vLLM doesn't do this, nobody can.

By @qjxjy123 :

GLM-5 and other models using DeepSeek Sparse Attention (DSA) cannot run on sm80 GPUs (A100/A800). There are multiple layers of sm80 incompatibility, not just a single missing fallback.

How to reproduce

python -m vllm.entrypoints.openai.api_server
--model ZhipuAI/GLM-5-FP8
--tensor-parallel-size 8
--max-model-len 8192
--cpu-offload-gb 20
--gpu-memory-utilization 0.95
--trust-remote-code
--enforce-eager
--port 8000

Root cause analysis (updated)

Correction: The original description oversimplified the problem as only a missing DeepGemm fallback in sparse_attn_indexer.py. After thorough testing, we found there are at least 3 layers of sm80 incompatibility:

Layer Problem Details 1. C++ compilation dsv3_fused_a_gemm in csrc/ops.h / csrc/torch_bindings.cpp unconditionally compiles sm90+ code Need #ifdef ENABLE_DSV3_FUSED_A_GEMM guards to skip on sm80 2. Attention backend No sparse MLA attention backend available for sm80 Upstream has FLASHMLA_SPARSE (sm90+ only). sm80 needs a Triton-based sparse MLA backend (e.g. TRITON_MLA_SPARSE) with registration in cuda.py and registry.py 3. Indexer fallback sparse_attn_indexer.py calls DeepGemm's fp8_mqa_logits / fp8_paged_mqa_logits without is_deep_gemm_supported() check Need PyTorch fallback functions when DeepGemm is unavailable

What we tested

We built a PoC that addresses all 3 layers and confirmed GLM-5-FP8 produces correct inference results on 8xA800:

Component Files What it does csrc guards csrc/ops.h, csrc/torch_bindings.cpp #ifdef ENABLE_DSV3_FUSED_A_GEMM to skip sm90+ code on sm80 Triton MLA Sparse backend triton_mla_sparse.py + triton_sparse_decode_attention.py + fp8_mqa_logits_fallback.py (3 new files) Full Triton-based sparse MLA attention backend for sm80 Backend registration cuda.py, registry.py Register TRITON_MLA_SPARSE in priority list and enum Indexer fallback sparse_attn_indexer.py, deep_gemm.py is_deep_gemm_supported() guard + fp8_mqa_logits_torch / fp8_paged_mqa_logits_torch PyTorch fallback Test config: TP=8, max_model_len=1024, cpu_offload=40GB, gpu_memory_utilization=0.92, max_num_seqs=16

Relationship with PR #35271

PR #35271 by @chaunceyjiang addresses Layer 3 (indexer fallback) with PyTorch fallback functions. This is a necessary piece, but not sufficient alone for sm80 — Layers 1 and 2 also need to be resolved.

Expected behavior

vLLM should support DSA/sparse MLA models (GLM-5, etc.) on sm80 GPUs (A100/A800), or at minimum fail early with a clear error message.

Additional context

A100/A800 GPUs are still widely deployed in datacenters

The non-sparse MLA path (e.g. DeepSeek-V3) works fine on sm80 via the Triton MLA backend — only the sparse variant is broken

What actually happened

After thorough testing on our 8xA800 cluster, we found that running GLM-5 (sparse MLA/DSA) on sm80 requires fixes at 3 separate layers, not just the indexer:

C++ compilation — dsv3_fused_a_gemm needs #ifdef guards to skip on sm80

Attention backend — sm80 has no sparse MLA backend; upstream only has FLASHMLA_SPARSE (sm90+). We had to write a full TRITON_MLA_SPARSE backend (3 new files: triton_mla_sparse.py, triton_sparse_decode_attention.py, fp8_mqa_logits_fallback.py)

Indexer fallback — sparse_attn_indexer.py DeepGemm calls need is_deep_gemm_supported() guards + PyTorch fallback

Testing PR #35271

We tested PR #35271 combined with our PoC patches (layers 1 & 2) and confirmed GLM-5-FP8 generates correct inference results on A800.

However, PR #35271 alone is not sufficient for sm80 — it only addresses layer 3. Without the attention backend and csrc fixes, the server either fails to compile or has no viable backend to select.

Also, one important note: the guard in sparse_attn_indexer.py should use is_deep_gemm_supported() (checks sm90+ arch), not has_deep_gemm() (only checks if the package is installed). We initially used has_deep_gemm() and it still called DeepGemm on sm80 because the package was installed.

Test details

Server config: TP=8, max_model_len=1024, cpu_offload=40GB, gpu_memory_utilization=0.92, max_num_seqs=16

Tested: basic chat, math, code generation, multi-turn conversation — all passed

The model's 744B parameters are very tight on 8x80GB, requiring aggressive CPU offload

I've updated the issue description to accurately reflect the full scope of the problem. Again, sorry for the initial confusion, and thank you @chaunceyjiang for the quick PR — the indexer fallback is definitely a necessary piece of the puzzle.

Alternatives

Using Hopper or Blackwell is not a legitimate solution for many users on Ampere.

Additional context

No response

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To fix the issue of running GLM-5 (sparse MLA/DSA) on sm80 GPUs, we need to address three separate layers:

C++ compilation: Add #ifdef guards to skip sm90+ code on sm80 in csrc/ops.h and csrc/torch_bindings.cpp.
Attention backend: Implement a full TRITON_MLA_SPARSE backend for sm80, including registration in cuda.py and registry.py.
Indexer fallback: Add is_deep_gemm_supported() guards and PyTorch fallback functions in sparse_attn_indexer.py.

Code Changes

In csrc/ops.h and csrc/torch_bindings.cpp, add:

#ifdef ENABLE_DSV3_FUSED_A_GEMM
// sm90+ code
#endif

Create new files triton_mla_sparse.py, triton_sparse_decode_attention.py, and fp8_mqa_logits_fallback.py to implement the TRITON_MLA_SPARSE backend.
In cuda.py and registry.py, register the TRITON_MLA_SPARSE backend:

# cuda.py
from .triton_mla_sparse import TRITON_MLA_SPARSE

# registry.py
from .cuda import TRITON_MLA_SPARSE

In sparse_attn_indexer.py, add:

if is_deep_gemm_supported():
    # DeepGemm code
else:
    # PyTorch fallback code

Verification

To verify the fix, run the following command:

python -m vllm.entrypoints.openai.api_server \
  --model ZhipuAI/GLM-5-FP8 \
  --tensor-parallel-size 8 \
  --max-model-len 8192 \
  --cpu-offload-gb 20 \
  --gpu-memory-utilization 0.95 \
  --trust-remote-code \
  --enforce-eager \
  --port 8000

Check that the model runs correctly and produces expected results.

Extra Tips

Make sure to test the fix thoroughly on different GPU architectures and models.
Consider adding additional logging and error handling to help diagnose any issues that may arise.
Keep in mind that this fix is specific to sm80 GPUs and may not be applicable to other architectures.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #callback error #memory management #API rate limit #retriever error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix [Feature]: Implement `TRITON_MLA_SPARSE` backend for sm80 support of Sparse MLA [1 pull requests, 9 comments, 5 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

How to reproduce

Root cause analysis (updated)

What we tested

Relationship with PR #35271

Expected behavior

Additional context

What actually happened

Testing PR #35271

Test details

Fix Action

Fix / Workaround

How to reproduce

Root cause analysis (updated)

What we tested

Relationship with PR #35271

Expected behavior

Additional context

What actually happened

Testing PR #35271

Test details

PR fix notes

PR #35271: [Feat] Add CUDA torch fallbacks for fp8_mqa_logits/fp8_paged_mqa_logits_torch function

Description (problem / solution / changelog)

Purpose

Test Plan

Test Result

Changed files

🚀 The feature, motivation and pitch

How to reproduce

Root cause analysis (updated)

What we tested

Relationship with PR #35271

Expected behavior

Additional context

What actually happened

Testing PR #35271

Test details

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING