vllm - 💡(How to fix) Fix [Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring [4 comments, 3 participants]

vllm2026-03-26 20:33:11

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#38282•Fetched 2026-04-08 01:36:49

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Assignees

Timeline (top)

commented ×4assigned ×2mentioned ×2subscribed ×2

RAW_BUFFERClick to expand / collapse

🚀 The feature, motivation and pitch

Replace vanilla MaxSim with flash-maxsim for late-interaction scoring

The current compute_maxsim_scores in vllm/v1/pool/late_interaction.py has three performance bottlenecks:

Python for-loop padding — copies embeddings one-by-one into padded tensors (~84% of scoring time)
Full similarity matrix in HBM — materializes [batch, Lq, Ld] tensor (peaks at 549 MB, OOMs at scale)
Serial mini-batching — max_score_matrix_elements=64M cap forces up to 157 sequential batches

Proposal: Replace with flash-maxsim, a fused Triton kernel that computes MaxSim via IO-aware tiling in SRAM. Never materializes the similarity matrix — O(1) memory. No max_score_matrix_elements cap needed — handles arbitrarily large batches without OOM or mini-batching.

Benchmarks (exact `compute_maxsim_scores` function, A100 40GB)

Config	Vanilla	Flash-MaxSim	Speedup
Serving pairs N=64	0.377 ms	0.045 ms	8.3×
Rerank B=1,000	4.82 ms	0.22 ms	22×
Rerank B=10,000	48.1 ms	2.22 ms	22×

Memory	Vanilla	Flash-MaxSim
B=1K, Lq=32, Ld=300	36.6 MB	0 MB
B=10K, Lq=1024, Ld=1024	39.1 GB → OOM	0 MB ✓

Numerical precision: flash-maxsim uses FP32 accumulation → 1000× more precise than the current FP16 bmm path.

Integration approach

The kernel source can be vendored directly into vLLM (no external pip dependency required). The integration replaces the inner scoring loop in compute_maxsim_scores — the rest of the late-interaction pipeline (query caching, doc scheduling, result aggregation) stays unchanged.

Apache 2.0 | GitHub
Supports: variable-length docs, query chunking, INT8, varlen packed sequences
Tested on H100, A100, V100

Happy to submit a PR with the kernel source integrated.

Alternatives

Optimize the current Python for-loop with torch.nn.utils.rnn.pad_sequence — fixes the padding bottleneck but still materializes the full similarity matrix and needs the 64M cap.
Use torch.compile on the existing bmm path — may help with fusion but doesn't eliminate the similarity matrix or the serial mini-batching.
flash-maxsim (this proposal) — eliminates all three bottlenecks in one fused kernel.

Additional context

Related to PR #35330 which introduced late-interaction scoring in vLLM v1. The current implementation works correctly but the scoring function becomes a bottleneck at scale (B>1000 candidates, visual queries with Lq>256).

flash-maxsim was validated against the exact vLLM function across 21 configurations — correctness verified, flash wins every config with no regressions.

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To replace the vanilla MaxSim with flash-maxsim for late-interaction scoring, follow these steps:

Integrate the flash-maxsim kernel source into vLLM by vendoring it directly.
Replace the inner scoring loop in compute_maxsim_scores with the flash-maxsim kernel.
Ensure the rest of the late-interaction pipeline remains unchanged.

Code Changes

Here's an example of how to integrate flash-maxsim:

import torch
from flash_maxsim import maxsim

def compute_maxsim_scores(embeddings, queries):
    # Replace the inner scoring loop with flash-maxsim
    scores = maxsim(embeddings, queries)
    return scores

Note: The maxsim function from flash-maxsim should be used to compute the MaxSim scores, eliminating the need for the Python for-loop and the full similarity matrix.

Verification

To verify the fix, run the benchmarks provided in the issue body and compare the results with the vanilla MaxSim implementation. The flash-maxsim implementation should show significant speedup and memory reduction.

Extra Tips

Ensure the flash-maxsim kernel is properly compiled and integrated into vLLM.
Test the implementation across various configurations to verify correctness and performance.
Consider submitting a PR with the kernel source integrated for further review and validation.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#ISR setup #authentication setup #request error #file not found #serialization error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring [4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

🚀 The feature, motivation and pitch

Replace vanilla MaxSim with flash-maxsim for late-interaction scoring

Benchmarks (exact `compute_maxsim_scores` function, A100 40GB)

Integration approach

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Replace vanilla MaxSim with flash-maxsim for late-interaction scoring [4 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

🚀 The feature, motivation and pitch

Replace vanilla MaxSim with flash-maxsim for late-interaction scoring

Benchmarks (exact compute_maxsim_scores function, A100 40GB)

Integration approach

Alternatives

Additional context

Before submitting a new issue...

extent analysis

Fix Plan

Code Changes

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Benchmarks (exact `compute_maxsim_scores` function, A100 40GB)