vllm - 💡(How to fix) Fix [Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring [4 comments, 3 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38282Fetched 2026-04-08 01:36:49
View on GitHub
Comments
4
Participants
3
Timeline
11
Reactions
0
Author
Timeline (top)
commented ×4assigned ×2mentioned ×2subscribed ×2
RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Replace vanilla MaxSim with flash-maxsim for late-interaction scoring

The current compute_maxsim_scores in vllm/v1/pool/late_interaction.py has three performance bottlenecks:

  1. Python for-loop padding — copies embeddings one-by-one into padded tensors (~84% of scoring time)
  2. Full similarity matrix in HBM — materializes [batch, Lq, Ld] tensor (peaks at 549 MB, OOMs at scale)
  3. Serial mini-batchingmax_score_matrix_elements=64M cap forces up to 157 sequential batches

Proposal: Replace with flash-maxsim, a fused Triton kernel that computes MaxSim via IO-aware tiling in SRAM. Never materializes the similarity matrix — O(1) memory. No max_score_matrix_elements cap needed — handles arbitrarily large batches without OOM or mini-batching.

Benchmarks (exact compute_maxsim_scores function, A100 40GB)

ConfigVanillaFlash-MaxSimSpeedup
Serving pairs N=640.377 ms0.045 ms8.3×
Rerank B=1,0004.82 ms0.22 ms22×
Rerank B=10,00048.1 ms2.22 ms22×
MemoryVanillaFlash-MaxSim
B=1K, Lq=32, Ld=30036.6 MB0 MB
B=10K, Lq=1024, Ld=102439.1 GB → OOM0 MB

Numerical precision: flash-maxsim uses FP32 accumulation → 1000× more precise than the current FP16 bmm path.

Integration approach

The kernel source can be vendored directly into vLLM (no external pip dependency required). The integration replaces the inner scoring loop in compute_maxsim_scores — the rest of the late-interaction pipeline (query caching, doc scheduling, result aggregation) stays unchanged.

  • Apache 2.0 | GitHub
  • Supports: variable-length docs, query chunking, INT8, varlen packed sequences
  • Tested on H100, A100, V100

Happy to submit a PR with the kernel source integrated.

Alternatives

  1. Optimize the current Python for-loop with torch.nn.utils.rnn.pad_sequence — fixes the padding bottleneck but still materializes the full similarity matrix and needs the 64M cap.
  2. Use torch.compile on the existing bmm path — may help with fusion but doesn't eliminate the similarity matrix or the serial mini-batching.
  3. flash-maxsim (this proposal) — eliminates all three bottlenecks in one fused kernel.

Additional context

Related to PR #35330 which introduced late-interaction scoring in vLLM v1. The current implementation works correctly but the scoring function becomes a bottleneck at scale (B>1000 candidates, visual queries with Lq>256).

flash-maxsim was validated against the exact vLLM function across 21 configurations — correctness verified, flash wins every config with no regressions.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To replace the vanilla MaxSim with flash-maxsim for late-interaction scoring, follow these steps:

  • Integrate the flash-maxsim kernel source into vLLM by vendoring it directly.
  • Replace the inner scoring loop in compute_maxsim_scores with the flash-maxsim kernel.
  • Ensure the rest of the late-interaction pipeline remains unchanged.

Code Changes

Here's an example of how to integrate flash-maxsim:

import torch
from flash_maxsim import maxsim

def compute_maxsim_scores(embeddings, queries):
    # Replace the inner scoring loop with flash-maxsim
    scores = maxsim(embeddings, queries)
    return scores

Note: The maxsim function from flash-maxsim should be used to compute the MaxSim scores, eliminating the need for the Python for-loop and the full similarity matrix.

Verification

To verify the fix, run the benchmarks provided in the issue body and compare the results with the vanilla MaxSim implementation. The flash-maxsim implementation should show significant speedup and memory reduction.

Extra Tips

  • Ensure the flash-maxsim kernel is properly compiled and integrated into vLLM.
  • Test the implementation across various configurations to verify correctness and performance.
  • Consider submitting a PR with the kernel source integrated for further review and validation.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING