vllm - ✅(Solved) Fix BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests [2 pull requests, 5 comments, 4 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#37648Fetched 2026-04-08 01:04:24
View on GitHub
Comments
5
Participants
4
Timeline
12
Reactions
0
Timeline (top)
commented ×5cross-referenced ×2subscribed ×2closed ×1

The /pooling endpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Both embed (dense) and token_classify (sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.

Single requests work fine. The crash occurs only after accumulating multiple requests.

Error Message

(EngineCore_DP0 pid=405) File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward (EngineCore_DP0 pid=405) hidden_states_all = hidden_states.split( (EngineCore_DP0 pid=405) ^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77] (APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

Root Cause

The /pooling endpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Both embed (dense) and token_classify (sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.

Single requests work fine. The crash occurs only after accumulating multiple requests.

Fix Action

Workaround

Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.

PR fix notes

PR #37873: [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region

Description (problem / solution / changelog)

Fix a crash that affected all RoBERTa-based embedding models (BAAI/bge-m3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After approximately max_position_embeddings / 2 requests the server crashes with: Assertion 'index out of bounds: 0 <= tmp25 < 8194' failed.

  • gpu_model_runner keeps a persistent GPU buffer self.positions that is reused across every request. Each request refreshes only the first num_scheduled_tokens entries via copy_to_gpu; the remaining padding slots [num_scheduled_tokens : num_input_tokens_padded] are not reset.
  • RoBERTa models call replace_roberta_positions outside the CUDA graph (before BertModel.forward), which does an in-place position_ids += padding_idx + 1 on the full padded tensor, including the stale padding slots. Because those slots are never reset by copy_to_gpu, each request adds another +(padding_idx + 1) to them.

For BAAI/bge-m3 (max_position_embeddings = 8194, padding_idx = 1, offset = 2) with short sentences (6 tokens) padded to 8:

padding slot value after K requests  =  V_init + 2K
overflow when                         >= 8194
K                                     = (8194 - V_init) / 2  ≈  3999

Fixes #37868

Related #37648 #37868

Purpose

Test Plan

Reproduce the bug

CUDA_LAUNCH_BLOCKING=1 vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' \
  --runner pooling

# Send 10000 sequential requests
for i in $(seq 1 10000); do
  response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed",
         "input": ["test sentence number '$i'"]}')
  if [ "$response" != "200" ]; then
    echo "FAILED at index $i (HTTP $response)"; break
  fi
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_multi_vector_retrieval.py -v -s
pytest tests/models/language/pooling/test_scoring.py -v -s

Test Result

Tested on RTX 5090 (Blackwell, sm_120), vLLM 0.18.1rc1.dev38+ga16133a0f:

Before fixAfter fix
10000 sequential BGE-M3 /pooling requestsCrash at index ~3999All pass
test_bge_m3.py (dense / sparse / ColBERT scores)PassPass
test_embedding.py[stsb-roberta-base-v2]PassPass
test_multi_vector_retrieval.pyPassPass
test_scoring.py (bge-reranker-v2-m3)PassPass

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.
</details>

Changed files

  • vllm/v1/worker/gpu_model_runner.py (modified, +2/-0)

PR #37884: [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding

Description (problem / solution / changelog)

Purpose

Fix a crash in all RoBERTa-based pooling/embedding models (BGE-M3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After ~4000 sequential requests the server dies with an out-of-bounds position embedding index.

Root cause: replace_roberta_positions() did an in-place position_ids += padding_idx + 1 on the persistent GPU positions buffer. The model runner refreshes only the first num_scheduled_tokens entries via copy_to_gpu each request — the remaining CUDA-graph padding slots keep their stale values. Every request adds another +(padding_idx + 1) to those slots, and eventually the values exceed max_position_embeddings.

For BAAI/bge-m3 (padding_idx=1, offset=2) with short inputs padded to 8 tokens, overflow happens after (8194 - V_init) / 2 ≈ 4000 requests.

Fix: move the padding_idx + 1 offset into RobertaEmbedding.forward as a non-in-place add (position_ids + offset instead of position_ids += offset). This computes the correct positions each call without mutating the persistent buffer. Also fix the same in-place += pattern in the transformers LegacyMixin.

Fixes #37648 Fixes #37868

Related: #37873 (alternative fix that zeroes the padding region in _preprocess)

Test Plan

Reproduce the bug (before fix)

vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

for i in $(seq 1 5000); do
  resp=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test '$i'"]}')
  [ "$resp" != "200" ] && echo "FAIL at $i (HTTP $resp)" && break
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_scoring.py -v -s

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
</details>

Changed files

  • vllm/model_executor/models/roberta.py (modified, +8/-24)
  • vllm/model_executor/models/transformers/legacy.py (modified, +4/-2)

Code Example

# Launch
docker run --gpus '"device=0"' -d --network host \
  --entrypoint vllm vllm/vllm-openai:v0.15.1 \
  serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
  --trust-remote-code \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100

---

(EngineCore_DP0 pid=405)   File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405)     hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405)                         ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.
RAW_BUFFERClick to expand / collapse

Description

The /pooling endpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Both embed (dense) and token_classify (sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.

Single requests work fine. The crash occurs only after accumulating multiple requests.

Environment

  • vLLM versions tested: v0.15.1 and latest (v0.17.x as of 2026-03-20)
  • Model: BAAI/bge-m3 with --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
  • GPU: NVIDIA B200 (183GB)
  • Docker image: vllm/vllm-openai:v0.15.1 and vllm/vllm-openai:latest

Steps to Reproduce

# Launch
docker run --gpus '"device=0"' -d --network host \
  --entrypoint vllm vllm/vllm-openai:v0.15.1 \
  serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
  --trust-remote-code \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100

Error Log

(EngineCore_DP0 pid=405)   File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405)     hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405)                         ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

Behavior

  • Single requests to /pooling (both embed and token_classify tasks) work correctly
  • After ~50-100 sequential single-item requests, the engine crashes with split_with_sizes mismatch
  • The crash is fatal — the engine enters EngineDeadError state and all subsequent requests return 500
  • Batch requests (multiple items in input array) crash immediately on the first request
  • Input text length does not matter — crash occurs even with very short texts (~10 tokens)

Workaround

Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.

Related

  • BGE-M3 support PR: #14526

extent analysis

Fix Plan

To address the issue, we need to modify the forward method in tokwise/methods.py to handle the split_with_sizes mismatch.

  • Modify the forward method to validate the split_sizes before calling split_with_sizes:

def forward(self, hidden_states, split_sizes): if sum(split_sizes) != hidden_states.size(0): # Handle the mismatch, e.g., by padding or truncating hidden_states hidden_states = self._handle_split_size_mismatch(hidden_states, split_sizes) hidden_states_all = hidden_states.split(split_sizes) # ... rest of the method remains the same

*   Implement the `_handle_split_size_mismatch` method to either pad or truncate `hidden_states` to match the sum of `split_sizes`:
    ```python
def _handle_split_size_mismatch(self, hidden_states, split_sizes):
    target_size = sum(split_sizes)
    if target_size < hidden_states.size(0):
        # Truncate hidden_states
        hidden_states = hidden_states[:target_size]
    else:
        # Pad hidden_states with zeros
        padding = torch.zeros(target_size - hidden_states.size(0), *hidden_states.size()[1:], device=hidden_states.device)
        hidden_states = torch.cat((hidden_states, padding), dim=0)
    return hidden_states

Verification

To verify the fix, run the same sequence of requests as before and check if the engine crashes:

for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done

If the engine no longer crashes, the fix is successful.

Extra Tips

  • Consider adding logging to monitor the frequency of split_with_sizes mismatches and the effectiveness of the padding/truncation strategy.
  • If the issue persists, investigate other potential causes, such as incorrect model configuration or data corruption.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING