vllm - ✅(Solved) Fix [Performance]: Flashinfer TRTLLM MoE for Qwen3.5 [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#36922Fetched 2026-04-08 00:43:35
View on GitHub
Comments
1
Participants
2
Timeline
10
Reactions
0
Timeline (top)
subscribed ×5added_to_project_v2 ×1commented ×1labeled ×1

PR fix notes

PR #2640: fix: autotuner cache key mismatch for trtllm-gen FP8 block scale MoE and FP8 routed MoE

Description (problem / solution / changelog)

<!-- .github/pull_request_template.md -->

📌 Description

The PR

  • fixes input shape mismatches to match the autotuner cache key for MoE FP8
  • enables autotuner for fp8 block scale routed moe

Issue1: Could not find tuned tactic for trtllm_fp8_block_scale_moe 2026-02-26 09:26:35,204 - INFO - autotuner.py:444 - flashinfer.jit: [AutoTunner]: Using fallback tactic for flashinfer::trtllm_fp8_block_scale_moe with input shapes (torch.Size([1024, 4096]), torch.Size([1024, 512]), torch.Size([0]), torch.Size([0]), torch.Size([1024, 4096]), torch.Size([32, 1024]))

Tuned with incorrect input: op=flashinfer::trtllm_fp8_block_scale_moe, profile=((1024, 4096), (1024, 512), (1024,), (1024,), (1024, 4096), (1024, 16384)) -> runner_id=0, tactic=[64, 5]

Issue2: Crash when autotuning trtllm_fp8_block_scale_routed_moe

  File "/flashinfer/flashinfer/fused_moe/core.py", line 2568, in trtllm_fp8_block_scale_routed_moe
    result = get_trtllm_moe_sm100_module().trtllm_fp8_block_scale_moe(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/flashinfer/flashinfer/fused_moe/core.py", line 1711, in trtllm_fp8_block_scale_moe_op
    _, tactic = tuner.choose_one(
                ^^^^^^^^^^^^^^^^^
  File "/flashinfer/flashinfer/autotuner.py", line 470, in choose_one
    tensors = self._prepare_input_tensors(p, inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/flashinfer/flashinfer/autotuner.py", line 792, in _prepare_input_tensors
    tensor = self._create_tensor_like(
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/flashinfer/flashinfer/autotuner.py", line 771, in _create_tensor_like
    dtype = origin_tensor.dtype
            ^^^^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'dtype'

Benchmark:

TokensBF16 (ms)BF16 TFLOPSFP8 Untuned (ms)FP8 Untuned TFLOPSFP8 Tuned (ms)FP8 Tuned TFLOPSFP8 routed Untuned (ms)FP8 routed Untuned TFLOPSFP8 routed Tuned (ms)FP8 routed Tuned TFLOPS
10241.877137.321.455177.071.187217.071.337192.801.514170.27
20481.952263.991.692304.651.425361.771.548333.041.662310.09
40962.194469.852.232461.792.561402.432.087493.881.887546.16
81923.594573.573.458596.153.439599.493.355614.503.582575.53
163845.423760.376.329651.475.852704.536.026684.175.670727.18

🔍 Related Issues

<!-- Link any related issues here -->

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai -->

Summary by CodeRabbit

  • Bug Fixes

    • Corrected a typo in autotuner debug log messages.
  • Refactor

    • Consolidated MoE tuning configuration and input preparation into a centralized setup, simplifying FP8/FP4 paths, reducing duplication, and improving runtime/shape validation and configurability.
  • Tests

    • Added tests verifying autotuner cache-key behavior across quantization modes and multiple token-count scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changed files

  • flashinfer/autotuner.py (modified, +2/-2)
  • flashinfer/fused_moe/core.py (modified, +216/-104)
  • tests/moe/test_moe_autotuner_cache_keys.py (added, +149/-0)

PR #2594: Bf16 routed moe

Description (problem / solution / changelog)

<!-- .github/pull_request_template.md -->

📌 Description

Add trtllm_bf16_routed_moe api

🔍 Related Issues

<!-- Link any related issues here -->

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

pytest tests/moe/test_trtllm_gen_routed_fused_moe.py::test_trtllm_gen_bf16_routed_fused_moe

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai -->

Summary by CodeRabbit

  • New Features
    • Added support for pre-computed routing in MoE operations, enabling flexible routing input strategies.
    • New routed MoE APIs now available: BF16 and FP8 variants support pre-packed top-k routing information.
    • Introduced dual-path mechanism allowing MoE operations to accept either routing logits or pre-computed routing data.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Changed files

  • csrc/trtllm_fused_moe_kernel_launcher.cu (modified, +50/-17)
  • flashinfer/__init__.py (modified, +3/-0)
  • flashinfer/fused_moe/__init__.py (modified, +2/-0)
  • flashinfer/fused_moe/core.py (modified, +125/-8)
  • tests/moe/test_trtllm_gen_routed_fused_moe.py (modified, +147/-3)

Code Example

================================================================================================================================================================
BF16 vs FP8 vs NVFP4 Comparison
================================================================================================================================================================
  EP |     Tokens |  BF16 (ms) |  BF16 TFLOPS |   FP8 (ms) |   FP8 TFLOPS |   NVFP4 (ms) |   NVFP4 TFLOPS |       Best
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   1 |       1024 |      1.961 |       131.41 |      1.349 |       191.06 |        0.631 |         408.45 | NVFP4 3.11x
   1 |       2048 |      2.152 |       239.53 |      1.596 |       323.02 |        0.737 |         698.90 | NVFP4 2.92x
   1 |       4096 |      2.375 |       434.05 |      2.083 |       494.87 |        0.907 |        1135.99 | NVFP4 2.62x
   1 |       8192 |      3.353 |       614.89 |      3.496 |       589.67 |        1.220 |        1690.18 | NVFP4 2.75x
   1 |      16384 |      5.470 |       753.85 |      6.223 |       662.56 |        1.857 |        2220.73 | NVFP4 2.95x
   2 |       1024 |      0.963 |       133.83 |      0.600 |       214.68 |        0.326 |         395.48 | NVFP4 2.95x
   2 |       2048 |      0.995 |       258.89 |      0.692 |       372.17 |        0.363 |         710.77 | NVFP4 2.75x
   2 |       4096 |      1.220 |       422.52 |      1.133 |       455.07 |        0.411 |        1255.06 | NVFP4 2.97x
   2 |       8192 |      1.711 |       602.35 |      1.807 |       570.29 |        0.662 |        1557.88 | NVFP4 2.59x
   2 |      16384 |      2.858 |       721.30 |      3.267 |       631.09 |        0.998 |        2066.61 | NVFP4 2.87x
   4 |       1024 |      0.510 |       126.32 |      0.354 |       181.88 |        0.221 |         291.26 | NVFP4 2.31x
   4 |       2048 |      0.535 |       240.97 |      0.412 |       312.72 |        0.230 |         561.35 | NVFP4 2.33x
   4 |       4096 |      0.639 |       403.22 |      0.653 |       394.37 |        0.243 |        1062.55 | NVFP4 2.64x
   4 |       8192 |      0.903 |       570.57 |      1.057 |       487.62 |        0.375 |        1374.77 | NVFP4 2.41x
   4 |      16384 |      1.509 |       683.32 |      1.950 |       528.67 |        0.576 |        1790.76 | NVFP4 2.62x
   8 |       1024 |      0.289 |       111.52 |      0.227 |       141.73 |        0.216 |         149.08 | NVFP4 1.34x
   8 |       2048 |      0.304 |       211.76 |      0.270 |       238.23 |        0.216 |         297.66 | NVFP4 1.41x
   8 |       4096 |      0.357 |       360.49 |      0.425 |       302.95 |        0.212 |         607.81 | NVFP4 1.69x
   8 |       8192 |      0.506 |       509.30 |      0.707 |       364.62 |        0.248 |        1037.70 | NVFP4 2.04x
   8 |      16384 |      0.846 |       609.23 |      1.310 |       393.48 |        0.393 |        1310.56 | NVFP4 2.15x
----------------------------------------------------------------------------------------------------------------------------------------------------------------

---

The output of `python collect_env.py`
RAW_BUFFERClick to expand / collapse

Proposal to improve performance

I noticed the following issues with respect to performance of Qwen3.5 Moe configurations

Benchmark on 1x NVIDIA Blackwell B200 with Qwen3.5 configuration:

  • num_experts = 512
  • topk = 10
  • intermediate_size = 1024
  • hidden_size = 4096
  • routing_method_type = RenormalizeNaive (Softmax -> TopK -> Renormalize)
================================================================================================================================================================
BF16 vs FP8 vs NVFP4 Comparison
================================================================================================================================================================
  EP |     Tokens |  BF16 (ms) |  BF16 TFLOPS |   FP8 (ms) |   FP8 TFLOPS |   NVFP4 (ms) |   NVFP4 TFLOPS |       Best
----------------------------------------------------------------------------------------------------------------------------------------------------------------
   1 |       1024 |      1.961 |       131.41 |      1.349 |       191.06 |        0.631 |         408.45 | NVFP4 3.11x
   1 |       2048 |      2.152 |       239.53 |      1.596 |       323.02 |        0.737 |         698.90 | NVFP4 2.92x
   1 |       4096 |      2.375 |       434.05 |      2.083 |       494.87 |        0.907 |        1135.99 | NVFP4 2.62x
   1 |       8192 |      3.353 |       614.89 |      3.496 |       589.67 |        1.220 |        1690.18 | NVFP4 2.75x
   1 |      16384 |      5.470 |       753.85 |      6.223 |       662.56 |        1.857 |        2220.73 | NVFP4 2.95x
   2 |       1024 |      0.963 |       133.83 |      0.600 |       214.68 |        0.326 |         395.48 | NVFP4 2.95x
   2 |       2048 |      0.995 |       258.89 |      0.692 |       372.17 |        0.363 |         710.77 | NVFP4 2.75x
   2 |       4096 |      1.220 |       422.52 |      1.133 |       455.07 |        0.411 |        1255.06 | NVFP4 2.97x
   2 |       8192 |      1.711 |       602.35 |      1.807 |       570.29 |        0.662 |        1557.88 | NVFP4 2.59x
   2 |      16384 |      2.858 |       721.30 |      3.267 |       631.09 |        0.998 |        2066.61 | NVFP4 2.87x
   4 |       1024 |      0.510 |       126.32 |      0.354 |       181.88 |        0.221 |         291.26 | NVFP4 2.31x
   4 |       2048 |      0.535 |       240.97 |      0.412 |       312.72 |        0.230 |         561.35 | NVFP4 2.33x
   4 |       4096 |      0.639 |       403.22 |      0.653 |       394.37 |        0.243 |        1062.55 | NVFP4 2.64x
   4 |       8192 |      0.903 |       570.57 |      1.057 |       487.62 |        0.375 |        1374.77 | NVFP4 2.41x
   4 |      16384 |      1.509 |       683.32 |      1.950 |       528.67 |        0.576 |        1790.76 | NVFP4 2.62x
   8 |       1024 |      0.289 |       111.52 |      0.227 |       141.73 |        0.216 |         149.08 | NVFP4 1.34x
   8 |       2048 |      0.304 |       211.76 |      0.270 |       238.23 |        0.216 |         297.66 | NVFP4 1.41x
   8 |       4096 |      0.357 |       360.49 |      0.425 |       302.95 |        0.212 |         607.81 | NVFP4 1.69x
   8 |       8192 |      0.506 |       509.30 |      0.707 |       364.62 |        0.248 |        1037.70 | NVFP4 2.04x
   8 |      16384 |      0.846 |       609.23 |      1.310 |       393.48 |        0.393 |        1310.56 | NVFP4 2.15x
----------------------------------------------------------------------------------------------------------------------------------------------------------------

CC @vadiklyutiy

Report of performance regression

No response

Misc discussion on performance

No response

Your current environment (if you think it is necessary)

The output of `python collect_env.py`

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the performance issues with Qwen3.5 Moe configurations, we will focus on enabling autotuning for FlashInfer TRTLLM routed MoE FP4 and FP8, and improving the caching of tuning results.

Step 1: Enable Autotuning for FlashInfer TRTLLM Routed MoE FP4

Enable autotuning by modifying the trtllm_moe.py file:

# vllm/model_executor/layers/fused_moe/trtllm_moe.py
class TRTLLMMoE(nn.Module):
    def __init__(self, ...):
        ...
        self.autotuning_enabled = True  # Enable autotuning

Step 2: Enable Autotuning for FlashInfer TRTLLM Routed MoE FP8

Merge the PR https://github.com/flashinfer-ai/flashinfer/pull/2640 to enable autotuning for FP8.

Step 3: Improve Caching of Tuning Results

Modify the caching mechanism to store and retrieve tuning results correctly:

# flashinfer/model_executor/layers/fused_moe/trtllm_moe.py
class TRTLLMMoE(nn.Module):
    def __init__(self, ...):
        ...
        self.cache = {}  # Initialize an empty cache

    def forward(self, ...):
        ...
        # Cache tuning results
        self.cache[(input_shape, num_experts)] = tuning_result
        ...

Step 4: Use MXFP8 MoE for Qwen 3.5

Consider using MXFP8 MoE for Qwen 3.5 configurations to improve performance.

Verification

Verify the fixes by running benchmarks and checking the performance improvements:

python benchmark.py --config qwen3.5 --num_experts 512 --topk 10 --intermediate_size 1024 --hidden_size 4096 --routing_method_type RenormalizeNaive

Check the output for improved performance metrics, such as reduced latency and increased TFLOPS.

Extra Tips

  • Monitor the caching mechanism to ensure it is working correctly and not causing performance regressions.
  • Consider implementing a fallback strategy to handle cases where autotuning is disabled or caching fails.
  • Keep an eye on upcoming PRs and updates to FlashInfer and VLLM to ensure compatibility and optimal performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING