vllm - 💡(How to fix) Fix [Bug]: MTP DeepSeek and Eagle Flash Attention Failures in Spec Decode Unit Tests [1 participants]

vllm2026-04-14 22:27:35

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#39839•Fetched 2026-04-16 06:36:19

View on GitHub

Comments

Participants

Timeline

Reactions

Author

puririshi98

Participants

puririshi98

Timeline (top)

labeled ×1

Error Message

Failure 1: test_eagle_correctness_light[FLASH_ATTN-deepseek_eagle]

Model: eagle618/deepseek-v3-random with FLASH_ATTN backend

Root cause: Flash Attention 4 (FA4) on SM100/SM110 (Blackwell) does not support head_dim=192 with equal Q and V head dimensions. The validation at vllm/vllm_flash_attn/cute/interface.py:114 enforces:

Standard range: head_dim between 8-128, divisible by 8
Special DeepSeek shape: (192, 128) only (asymmetric Q/V)

The eagle618/deepseek-v3-random model has (head_dim, head_dim_v) = (192, 192) which doesn't match either pattern. The actual DeepSeek-V3 production model uses (192, 128) and would work fine — this is specific to the random test model's symmetric head dims.

Error:
AssertionError: (head_dim, head_dim_v)=(192, 192) is not supported on SM100/SM110.
head_dim and head_dim_v must be between 8 and 128 and divisible by 8, or (192, 128) for DeepSeek.

Call chain: _run_eagle_correctness → LLM() → EngineCore init → CUDA graph capture → flash_attn_varlen_func → _validate_head_dims

Failure 2: test_mtp_correctness[deepseek]

Model: ZixiQi/DeepSeek-V3-4layers-MTP-FP8 with auto attention backend

Root cause: FlashInfer's TRTLLM fused MoE kernel on SM100 has a strict inequality check: top_k < (topk_group * num_experts / n_group). The DeepSeek-V3 model's routing config hits this boundary exactly with 4 == 4, causing the check to fail.
The upstream code in flashinfer/csrc/trtllm_fused_moe_kernel_launcher.cu:928 uses < instead of <=.

Error:
RuntimeError: Check failed: args->top_k < (args->topk_group * args->num_experts / args->n_group) (4 vs. 4) :
top_k must be less than total number of experts in selected groups

Call chain: test_mtp_correctness → LLM() → EngineCore init → moe_forward_shared → TRTLLM fused MoE kernel → strict top_k check fails

This is an upstream FlashInfer bug (off-by-one: should be <= not <). Workaround would be to force moe_backend="triton" for this model on SM100+ or skip the test on Blackwell.

Root Cause

Root cause: Flash Attention 4 (FA4) on SM100/SM110 (Blackwell) does not support head_dim=192 with equal Q and V head dimensions. The validation at vllm/vllm_flash_attn/cute/interface.py:114 enforces:

Standard range: head_dim between 8-128, divisible by 8
Special DeepSeek shape: (192, 128) only (asymmetric Q/V)

Fix Action

Fix / Workaround

This is an upstream FlashInfer bug (off-by-one: should be <= not <). Workaround would be to force moe_backend="triton" for this model on SM100+ or skip the test on Blackwell.

Code Example

Failure 1: test_eagle_correctness_light[FLASH_ATTN-deepseek_eagle]
                             
  Model: eagle618/deepseek-v3-random with FLASH_ATTN backend     
                                 
  Root cause: Flash Attention 4 (FA4) on SM100/SM110 (Blackwell) does not support head_dim=192 with equal Q and V head dimensions. The validation at vllm/vllm_flash_attn/cute/interface.py:114 enforces:
  - Standard range: head_dim between 8-128, divisible by 8       
  - Special DeepSeek shape: (192, 128) only (asymmetric Q/V)
                                                                 
  The eagle618/deepseek-v3-random model has (head_dim, head_dim_v) = (192, 192) which doesn't match either pattern. The actual DeepSeek-V3 production model uses (192, 128) and would work fine — this is specific to the random test model's
  symmetric head dims.                                                                                                                                                                                                                              
                                                                 
  Error:                                                                                                                                                                                                                                            
  AssertionError: (head_dim, head_dim_v)=(192, 192) is not supported on SM100/SM110.                                                                                                                                                                
  head_dim and head_dim_v must be between 8 and 128 and divisible by 8, or (192, 128) for DeepSeek.                                                                                                                                                 
                                                                                                                                                                                                                                                    
  Call chain: _run_eagle_correctness → LLM() → EngineCore init → CUDA graph capture → flash_attn_varlen_func → _validate_head_dims                                                                                                                  
                                                                                                                                                                                                                                                    
  ---                                                                                                                                                                                                                                               
  Failure 2: test_mtp_correctness[deepseek]                                                                                                                                                                                                         
                                                                                                                                                                                                                                                    
  Model: ZixiQi/DeepSeek-V3-4layers-MTP-FP8 with auto attention backend                                                                                                                                                                             
                                                                                                                                                                                                                                                    
  Root cause: FlashInfer's TRTLLM fused MoE kernel on SM100 has a strict inequality check: top_k < (topk_group * num_experts / n_group). The DeepSeek-V3 model's routing config hits this boundary exactly with 4 == 4, causing the check to fail.  
  The upstream code in flashinfer/csrc/trtllm_fused_moe_kernel_launcher.cu:928 uses < instead of <=.
                                                                                                                                                                                                                                                    
  Error:                                                                                                                                                                                                                                            
  RuntimeError: Check failed: args->top_k < (args->topk_group * args->num_experts / args->n_group) (4 vs. 4) :                                                                                                                                      
  top_k must be less than total number of experts in selected groups                                                                                                                                                                                
                                                                                                                                                                                                                                                    
  Call chain: test_mtp_correctness → LLM() → EngineCore init → moe_forward_shared → TRTLLM fused MoE kernel → strict top_k check fails                                                                                                              
                                                                                   
  This is an upstream FlashInfer bug (off-by-one: should be <= not <). Workaround would be to force moe_backend="triton" for this model on SM100+ or skip the test on Blackwell.

RAW_BUFFERClick to expand / collapse

Your current environment

  Failure 1: test_eagle_correctness_light[FLASH_ATTN-deepseek_eagle]
                             
  Model: eagle618/deepseek-v3-random with FLASH_ATTN backend     
                                 
  Root cause: Flash Attention 4 (FA4) on SM100/SM110 (Blackwell) does not support head_dim=192 with equal Q and V head dimensions. The validation at vllm/vllm_flash_attn/cute/interface.py:114 enforces:
  - Standard range: head_dim between 8-128, divisible by 8       
  - Special DeepSeek shape: (192, 128) only (asymmetric Q/V)
                                                                 
  The eagle618/deepseek-v3-random model has (head_dim, head_dim_v) = (192, 192) which doesn't match either pattern. The actual DeepSeek-V3 production model uses (192, 128) and would work fine — this is specific to the random test model's
  symmetric head dims.                                                                                                                                                                                                                              
                                                                 
  Error:                                                                                                                                                                                                                                            
  AssertionError: (head_dim, head_dim_v)=(192, 192) is not supported on SM100/SM110.                                                                                                                                                                
  head_dim and head_dim_v must be between 8 and 128 and divisible by 8, or (192, 128) for DeepSeek.                                                                                                                                                 
                                                                                                                                                                                                                                                    
  Call chain: _run_eagle_correctness → LLM() → EngineCore init → CUDA graph capture → flash_attn_varlen_func → _validate_head_dims                                                                                                                  
                                                                                                                                                                                                                                                    
  ---                                                                                                                                                                                                                                               
  Failure 2: test_mtp_correctness[deepseek]                                                                                                                                                                                                         
                                                                                                                                                                                                                                                    
  Model: ZixiQi/DeepSeek-V3-4layers-MTP-FP8 with auto attention backend                                                                                                                                                                             
                                                                                                                                                                                                                                                    
  Root cause: FlashInfer's TRTLLM fused MoE kernel on SM100 has a strict inequality check: top_k < (topk_group * num_experts / n_group). The DeepSeek-V3 model's routing config hits this boundary exactly with 4 == 4, causing the check to fail.  
  The upstream code in flashinfer/csrc/trtllm_fused_moe_kernel_launcher.cu:928 uses < instead of <=.
                                                                                                                                                                                                                                                    
  Error:                                                                                                                                                                                                                                            
  RuntimeError: Check failed: args->top_k < (args->topk_group * args->num_experts / args->n_group) (4 vs. 4) :                                                                                                                                      
  top_k must be less than total number of experts in selected groups                                                                                                                                                                                
                                                                                                                                                                                                                                                    
  Call chain: test_mtp_correctness → LLM() → EngineCore init → moe_forward_shared → TRTLLM fused MoE kernel → strict top_k check fails                                                                                                              
                                                                                   
  This is an upstream FlashInfer bug (off-by-one: should be <= not <). Workaround would be to force moe_backend="triton" for this model on SM100+ or skip the test on Blackwell.

🐛 Describe the bug

trigger with the e2e specdecode unit tests on gb200 with nvidia latest container+github master vllm install

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The most likely fix involves adjusting the model configuration to match the supported head dimensions for Flash Attention 4 (FA4) on SM100/SM110 or applying a workaround for the FlashInfer bug by forcing the moe_backend to "triton" for the affected model.

Guidance

For Failure 1, consider changing the head_dim and head_dim_v of the eagle618/deepseek-v3-random model to match the supported patterns, such as (192, 128) for DeepSeek.
For Failure 2, a potential workaround is to force moe_backend="triton" for the ZixiQi/DeepSeek-V3-4layers-MTP-FP8 model on SM100+ to bypass the FlashInfer bug.
Verify the changes by re-running the e2e specdecode unit tests on gb200 with the updated model configurations or workarounds.
If the issue persists, consider skipping the test on Blackwell for the affected models as a temporary measure.

Example

No code snippet is provided as the issue does not require a specific code change, but rather a configuration adjustment or workaround.

Notes

The provided solutions are based on the information given in the issue and may not be applicable in all scenarios. The FlashInfer bug is noted as an upstream issue, and the suggested workaround may not be a permanent fix.

Recommendation

Apply the workaround by forcing moe_backend="triton" for the affected model on SM100+, as this is a more immediate solution to the FlashInfer bug, while also considering adjustments to the model configurations to match supported patterns for long-term compatibility.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#model download #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Bug]: MTP DeepSeek and Eagle Flash Attention Failures in Spec Decode Unit Tests [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Bug]: MTP DeepSeek and Eagle Flash Attention Failures in Spec Decode Unit Tests [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fix / Workaround

Code Example

Your current environment

🐛 Describe the bug

Before submitting a new issue...

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING