vllm - ✅(Solved) Fix BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests [2 pull requests, 5 comments, 4 participants]

vllm2026-03-20 06:54:37

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#37648•Fetched 2026-04-08 01:04:24

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

commented ×5cross-referenced ×2subscribed ×2closed ×1

The /pooling endpoint for BGE-M3 (BgeM3EmbeddingModel) crashes after approximately 50-100 sequential requests. Both embed (dense) and token_classify (sparse) tasks are affected. The error kills the engine entirely (EngineDeadError), requiring a full container restart.

Single requests work fine. The crash occurs only after accumulating multiple requests.

Error Message

(EngineCore_DP0 pid=405) File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward (EngineCore_DP0 pid=405) hidden_states_all = hidden_states.split( (EngineCore_DP0 pid=405) ^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77] (APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed. (APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

Root Cause

Single requests work fine. The crash occurs only after accumulating multiple requests.

Fix Action

Workaround

Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.

PR fix notes

PR #37873: [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region

Repository: vllm-project/vllm
Author: yanghui1-arch
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37873

Description (problem / solution / changelog)

Fix a crash that affected all RoBERTa-based embedding models (BAAI/bge-m3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After approximately max_position_embeddings / 2 requests the server crashes with: Assertion 'index out of bounds: 0 <= tmp25 < 8194' failed.

gpu_model_runner keeps a persistent GPU buffer self.positions that is reused across every request. Each request refreshes only the first num_scheduled_tokens entries via copy_to_gpu; the remaining padding slots [num_scheduled_tokens : num_input_tokens_padded] are not reset.
RoBERTa models call replace_roberta_positions outside the CUDA graph (before BertModel.forward), which does an in-place position_ids += padding_idx + 1 on the full padded tensor, including the stale padding slots. Because those slots are never reset by copy_to_gpu, each request adds another +(padding_idx + 1) to them.

For BAAI/bge-m3 (max_position_embeddings = 8194, padding_idx = 1, offset = 2) with short sentences (6 tokens) padded to 8:

padding slot value after K requests  =  V_init + 2K
overflow when                         >= 8194
K                                     = (8194 - V_init) / 2  ≈  3999

Fixes #37868

Related #37648 #37868

Purpose

Test Plan

Reproduce the bug

CUDA_LAUNCH_BLOCKING=1 vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}' \
  --runner pooling

# Send 10000 sequential requests
for i in $(seq 1 10000); do
  response=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed",
         "input": ["test sentence number '$i'"]}')
  if [ "$response" != "200" ]; then
    echo "FAILED at index $i (HTTP $response)"; break
  fi
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_multi_vector_retrieval.py -v -s
pytest tests/models/language/pooling/test_scoring.py -v -s

Test Result

Tested on RTX 5090 (Blackwell, sm_120), vLLM 0.18.1rc1.dev38+ga16133a0f:

	Before fix	After fix
10000 sequential BGE-M3 `/pooling` requests	Crash at index ~3999	All pass
`test_bge_m3.py` (dense / sparse / ColBERT scores)	Pass	Pass
`test_embedding.py[stsb-roberta-base-v2]`	Pass	Pass
`test_multi_vector_retrieval.py`	Pass	Pass
`test_scoring.py` (bge-reranker-v2-m3)	Pass	Pass

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

</details>

Changed files

vllm/v1/worker/gpu_model_runner.py (modified, +2/-0)

PR #37884: [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding

Repository: vllm-project/vllm
Author: he-yufeng
State: closed | merged: True
Link: https://github.com/vllm-project/vllm/pull/37884

Description (problem / solution / changelog)

Purpose

Fix a crash in all RoBERTa-based pooling/embedding models (BGE-M3, XLM-RoBERTa, stsb-roberta, bge-reranker-v2-m3) when CUDA graphs are enabled. After ~4000 sequential requests the server dies with an out-of-bounds position embedding index.

Root cause: replace_roberta_positions() did an in-place position_ids += padding_idx + 1 on the persistent GPU positions buffer. The model runner refreshes only the first num_scheduled_tokens entries via copy_to_gpu each request — the remaining CUDA-graph padding slots keep their stale values. Every request adds another +(padding_idx + 1) to those slots, and eventually the values exceed max_position_embeddings.

For BAAI/bge-m3 (padding_idx=1, offset=2) with short inputs padded to 8 tokens, overflow happens after (8194 - V_init) / 2 ≈ 4000 requests.

Fix: move the padding_idx + 1 offset into RobertaEmbedding.forward as a non-in-place add (position_ids + offset instead of position_ids += offset). This computes the correct positions each call without mutating the persistent buffer. Also fix the same in-place += pattern in the transformers LegacyMixin.

Fixes #37648 Fixes #37868

Related: #37873 (alternative fix that zeroes the padding region in _preprocess)

Test Plan

Reproduce the bug (before fix)

vllm serve BAAI/bge-m3 --port 9001 \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

for i in $(seq 1 5000); do
  resp=$(curl -s -o /dev/null -w "%{http_code}" http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test '$i'"]}')
  [ "$resp" != "200" ] && echo "FAIL at $i (HTTP $resp)" && break
done

Existing tests

pytest tests/models/language/pooling/test_bge_m3.py -v -s
pytest tests/models/language/pooling/test_embedding.py -v -s -k "stsb-roberta"
pytest tests/models/language/pooling/test_scoring.py -v -s

<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results

</details>

Changed files

vllm/model_executor/models/roberta.py (modified, +8/-24)
vllm/model_executor/models/transformers/legacy.py (modified, +4/-2)

Code Example

# Launch
docker run --gpus '"device=0"' -d --network host \
  --entrypoint vllm vllm/vllm-openai:v0.15.1 \
  serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
  --trust-remote-code \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100

---

(EngineCore_DP0 pid=405)   File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405)     hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405)                         ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

RAW_BUFFERClick to expand / collapse

Description

Single requests work fine. The crash occurs only after accumulating multiple requests.

Environment

vLLM versions tested: v0.15.1 and latest (v0.17.x as of 2026-03-20)
Model: BAAI/bge-m3 with --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'
GPU: NVIDIA B200 (183GB)
Docker image: vllm/vllm-openai:v0.15.1 and vllm/vllm-openai:latest

Steps to Reproduce

# Launch
docker run --gpus '"device=0"' -d --network host \
  --entrypoint vllm vllm/vllm-openai:v0.15.1 \
  serve BAAI/bge-m3 --host 0.0.0.0 --port 9001 \
  --trust-remote-code \
  --hf-overrides '{"architectures": ["BgeM3EmbeddingModel"]}'

# Wait for model to load, then send ~100 sequential requests
for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done
# Server crashes around request 50-100

Error Log

(EngineCore_DP0 pid=405)   File ".../vllm/model_executor/layers/pooler/tokwise/methods.py", line 50, in forward
(EngineCore_DP0 pid=405)     hidden_states_all = hidden_states.split(
(EngineCore_DP0 pid=405)                         ^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=405) RuntimeError: split_with_sizes expects split_sizes to sum exactly to 4192 (input tensor's size at dimension 0), but got split_sizes=[74, 43, 230, 510, 646, 348, 205, 77]
(APIServer pid=1) ERROR [async_llm.py:708] AsyncLLM output_handler failed.
(APIServer pid=1) ERROR [async_llm.py:708] vllm.v1.engine.exceptions.EngineDeadError: EngineCore encountered an issue.

Behavior

Single requests to /pooling (both embed and token_classify tasks) work correctly
After ~50-100 sequential single-item requests, the engine crashes with split_with_sizes mismatch
The crash is fatal — the engine enters EngineDeadError state and all subsequent requests return 500
Batch requests (multiple items in input array) crash immediately on the first request
Input text length does not matter — crash occurs even with very short texts (~10 tokens)

Workaround

Restart the vLLM container after every ~50 requests. Deterministic point IDs allow resumable indexing.

BGE-M3 support PR: #14526

extent analysis

Fix Plan

To address the issue, we need to modify the forward method in tokwise/methods.py to handle the split_with_sizes mismatch.

Modify the forward method to validate the split_sizes before calling split_with_sizes:

def forward(self, hidden_states, split_sizes): if sum(split_sizes) != hidden_states.size(0): # Handle the mismatch, e.g., by padding or truncating hidden_states hidden_states = self._handle_split_size_mismatch(hidden_states, split_sizes) hidden_states_all = hidden_states.split(split_sizes) # ... rest of the method remains the same

*   Implement the `_handle_split_size_mismatch` method to either pad or truncate `hidden_states` to match the sum of `split_sizes`:
    ```python
def _handle_split_size_mismatch(self, hidden_states, split_sizes):
    target_size = sum(split_sizes)
    if target_size < hidden_states.size(0):
        # Truncate hidden_states
        hidden_states = hidden_states[:target_size]
    else:
        # Pad hidden_states with zeros
        padding = torch.zeros(target_size - hidden_states.size(0), *hidden_states.size()[1:], device=hidden_states.device)
        hidden_states = torch.cat((hidden_states, padding), dim=0)
    return hidden_states

Verification

To verify the fix, run the same sequence of requests as before and check if the engine crashes:

for i in $(seq 1 100); do
  curl -s http://localhost:9001/pooling \
    -H "Content-Type: application/json" \
    -d '{"model": "BAAI/bge-m3", "task": "embed", "input": ["test sentence number '$i'"]}' > /dev/null
done

If the engine no longer crashes, the fix is successful.

Extra Tips

Consider adding logging to monitor the frequency of split_with_sizes mismatches and the effectiveness of the padding/truncation strategy.
If the issue persists, investigate other potential causes, such as incorrect model configuration or data corruption.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error #agent execution #callback error #memory management #API rate limit

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

vllm - ✅(Solved) Fix BGE-M3 /pooling endpoint crashes with split_with_sizes error after ~50-100 requests [2 pull requests, 5 comments, 4 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Workaround

PR fix notes

PR #37873: [Bugfix] RoBERTa position_id accumulation in CUDA graph padding region

Description (problem / solution / changelog)

Purpose

Test Plan

Reproduce the bug

Existing tests

Test Result

Changed files

PR #37884: [Bugfix] Fix RoBERTa position_ids accumulation on CUDA graph padding

Description (problem / solution / changelog)

Purpose

Test Plan

Reproduce the bug (before fix)

Existing tests

Changed files

Code Example

Description

Environment

Steps to Reproduce

Error Log

Behavior

Workaround

Related

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING