vllm - ✅(Solved) Fix [Bug]: MoRI Connector hangs at >=128 concurrency [1 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

No error but hangs forever here:

Fix Action

Fix / Workaround

docker run -d
--name moriio-toy-proxy
--network host
"${VLLM_IMAGE}"
bash -c "pip install --quiet --ignore-installed quart aiohttp msgpack &&
python3 /tmp/patch_toy_proxy.py &&
python3 -u ${TOY_PROXY_CONTAINER_PATH}"

PR fix notes

PR #40344: [Fix] Resolve MoRI connector hangs at high concurrency

Description (problem / solution / changelog)

Purpose

Fixes #40340.

There are a few parts of the MoRI-IO connector code that can cause indefinite hangs of the connector. This PR resolves them:

  1. Disable MoRI's in-band notifications (set enable_notification=False in RdmaBackendConfig) as we use ZMQ for completion notifications anyhow, and under high concurrency those notifications poison the transfer statuses because the QP send queue is exhausted (causing requests to be stuck in WAITING_FOR_REMOTE_KVS).
  2. Handle Failed() transfers in _pop_done_transfers so prefill frees its blocks. These were otherwise stuck in _recving_transfers forever.
  3. Replace asserts in _update_from_kv_xfer_finished with checks because async-polling connectors can deliver finished_recving/finished_sending after a request has already been removed or advanced past WAITING_FOR_REMOTE_KVS
  4. Replace status.Wait() infinite busy-spin with polling w/ deadline.
  5. Add 1ms sleep in busy-spin while True loops

Test Plan

  1. Build an image in this branch, including the relevant NIC drivers and userspace libraries. Or if you run on MI300X nodes with Thor2 NICs, you can pull the image I built:
docker pull ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes
  1. 1p1d deployment tested with vllm bench serve @ 256 concurrency:
# Set on both nodes before running any command
export PREFILL_IP=<node1-ip>
export DECODE_IP=<node2-ip>

# Node 1 (prefill node) — command 1: start toy proxy
docker run -d \
  --name moriio-toy-proxy \
  --network host \
  --rm \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  bash -c "pip install --quiet --ignore-installed quart aiohttp msgpack && \
           python3 -u /app/vllm/examples/online_serving/disaggregated_serving/moriio_toy_proxy_server.py"

# Node 1 (prefill node) — command 2: start prefill instance
docker run -d \
  --rm \
  --name moriio-prefill \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_USE_V1=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_USE_AITER_TRITON_SILU_MUL=0 \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  vllm serve deepseek-ai/DeepSeek-R1-0528 \
    --port 8100 \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --max-num-batched-tokens 32768 \
    --max-model-len 16384 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --block-size 1 \
    --enforce-eager \
    --load-format dummy \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_producer",
      "kv_connector_extra_config": {
        "proxy_ip": "'"${PREFILL_IP}"'",
        "proxy_ping_port": "36367",
        "http_port": "8100",
        "handshake_port": "6301",
        "notify_port": "61005"
      }
    }'

# Node 2 (decode node) — command 3: start decode instance
docker run -d \
  --rm \
  --name moriio-decode \
  --init --network host --ipc host --privileged \
  --cap-add SYS_PTRACE --security-opt seccomp=unconfined \
  --ulimit memlock=-1 --ulimit stack=67108864 \
  --shm-size 256G \
  --group-add video --group-add render \
  --device /dev/kfd --device /dev/dri --device /dev/infiniband \
  -v /sys:/sys \
  -v "${HOME}/.cache/huggingface:/root/.cache/huggingface" \
  -e HF_HOME=/root/.cache/huggingface \
  -e HF_HUB_ENABLE_HF_TRANSFER=0 \
  -e VLLM_MORIIO_CONNECTOR_READ_MODE=1 \
  -e NCCL_MIN_NCHANNELS=112 \
  -e VLLM_USE_V1=1 \
  -e VLLM_ENGINE_READY_TIMEOUT_S=3600 \
  -e VLLM_SERVER_DEV_MODE=1 \
  -e VLLM_ROCM_USE_AITER=1 \
  -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0 \
  -e VLLM_ROCM_USE_AITER_RMSNORM=1 \
  -e VLLM_USE_AITER_TRITON_SILU_MUL=0 \
  ghcr.io/simondanielsson/vllm-rocm-moriio:dev-hang-fixes \
  vllm serve deepseek-ai/DeepSeek-R1-0528 \
    --port 8200 \
    --tensor-parallel-size 8 \
    --kv-cache-dtype fp8 \
    --gpu-memory-utilization 0.7 \
    --max-num-batched-tokens 32768 \
    --max-model-len 16384 \
    --trust-remote-code \
    --no-enable-prefix-caching \
    --block-size 1 \
    --enable-expert-parallel \
    --all2all-backend mori \
    --compilation-config '{"cudagraph_mode": "PIECEWISE"}' \
    --load-format dummy \
    --kv-transfer-config '{
      "kv_connector": "MoRIIOConnector",
      "kv_role": "kv_consumer",
      "kv_connector_extra_config": {
        "proxy_ip": "'"${PREFILL_IP}"'",
        "proxy_ping_port": "36367",
        "http_port": "8200",
        "handshake_port": "6301",
        "notify_port": "61005"
      }
    }'

#  Node 1 (prefill node) — command 4: verify both instances registered with toy proxy
docker logs moriio-toy-proxy 2>&1 | grep -E "Registered (Prefill|Decode)"

# Node 1 (prefill node) — command 5: run vllm bench serve
docker exec moriio-prefill \
  vllm bench serve \
    --base-url http://localhost:10001 \
    --backend vllm \
    --model deepseek-ai/DeepSeek-R1-0528 \
    --dataset-name random \
    --random-input-len 1000 \
    --random-output-len 1000 \
    --max-concurrency 256 \
    --num-warmups 512 \
    --num-prompts 2560 \
    --seed 1234

Test Result


<details> <summary> Essential Elements of an Effective PR Description Checklist </summary>
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
</details>

Changed files

  • vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_connector.py (modified, +58/-8)
  • vllm/distributed/kv_transfer/kv_connector/v1/moriio/moriio_engine.py (modified, +76/-14)
  • vllm/envs.py (modified, +10/-0)
  • vllm/v1/core/sched/scheduler.py (modified, +57/-4)

Code Example

Your output of `python collect_env.py` here

---

# ── vLLM serve flags shared between prefill and decode ───────────────────────
VLLM_SERVE_ARGS=(
    --tensor-parallel-size 8
    --kv-cache-dtype fp8
    --gpu-memory-utilization 0.7
    --max-num-batched-tokens 32768
    --max-model-len 16384
    --trust-remote-code
    --no-enable-prefix-caching
    --block-size 1
)

# ── Role-specific vLLM serve flags ───────────────────────────────────────────
PREFILL_EXTRA_ARGS=(
    --enforce-eager
)

DECODE_EXTRA_ARGS=(
    --enable-expert-parallel
    --all2all-backend mori
    --compilation-config '{"cudagraph_mode": "PIECEWISE"}'
)

PREFILL_KV_CONFIG=$(cat <<EOF
{
  "kv_connector": "MoRIIOConnector",
  "kv_role": "kv_producer",
  "kv_connector_extra_config": {
    "proxy_ip": "${PREFILL_IP}",
    "proxy_ping_port": "${PROXY_PING_PORT}",
    "http_port": "${PREFILL_PORT}",
    "handshake_port": "${HANDSHAKE_PORT}",
    "notify_port": "${NOTIFY_PORT}"
  }
}
EOF
)

DECODE_KV_CONFIG=$(cat <<EOF
{
  "kv_connector": "MoRIIOConnector",
  "kv_role": "kv_consumer",
  "kv_connector_extra_config": {
    "proxy_ip": "${PREFILL_IP}",
    "proxy_ping_port": "${PROXY_PING_PORT}",
    "http_port": "${DECODE_PORT}",
    "handshake_port": "${HANDSHAKE_PORT}",
    "notify_port": "${NOTIFY_PORT}"
  }
}
EOF
)

# ── Common docker run flags ───────────────────────────────────────────────────
VLLM_COMMON_ARGS=(
    --init
    --network host
    --ipc host
    --privileged
    --cap-add SYS_PTRACE
    --security-opt seccomp=unconfined
    --ulimit memlock=-1
    --ulimit stack=67108864
    --shm-size "${SHM_SIZE}"
    --group-add video
    --group-add render
    --device /dev/kfd
    --device /dev/dri
    --device /dev/infiniband
    -v /sys:/sys
    -v "${HF_HOME}:/root/.cache/huggingface"
    -e HF_HOME=/root/.cache/huggingface
    -e HF_HUB_ENABLE_HF_TRANSFER=0
    -e VLLM_MORIIO_CONNECTOR_READ_MODE=1
    -e NCCL_MIN_NCHANNELS=112
    -e VLLM_USE_V1=1
    -e VLLM_ENGINE_READY_TIMEOUT_S=3600
    -e VLLM_SERVER_DEV_MODE=1
    -e VLLM_ROCM_USE_AITER=1
    -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0
    -e VLLM_ROCM_USE_AITER_RMSNORM=1
    -e VLLM_USE_AITER_TRITON_SILU_MUL=0
)

# On node 1, run P and the proxy
docker run -d \
    --name moriio-prefill \
    "${VLLM_COMMON_ARGS[@]}" \
    "${VLLM_IMAGE}" \
    vllm serve "${MODEL}" \
        --port "${PREFILL_PORT}" \
        "${VLLM_SERVE_ARGS[@]}" \
        "${PREFILL_EXTRA_ARGS[@]}" \
        --kv-transfer-config "${PREFILL_KV_CONFIG}"


docker run -d \
    --name moriio-toy-proxy \
    --network host \
    "${VLLM_IMAGE}" \
    bash -c "pip install --quiet --ignore-installed quart aiohttp msgpack && \
             python3 /tmp/patch_toy_proxy.py && \
             python3 -u ${TOY_PROXY_CONTAINER_PATH}"

# On node 2, run D
docker run -d \
    --name moriio-decode \
    "${VLLM_COMMON_ARGS[@]}" \
    "${VLLM_IMAGE}" \
    vllm serve "${MODEL}" \
        --port "${DECODE_PORT}" \
        "${VLLM_SERVE_ARGS[@]}" \
        "${DECODE_EXTRA_ARGS[@]}" \
        --kv-transfer-config "${DECODE_KV_CONFIG}"


# On node 1 again run the benchmark
BENCH_ARGS=(
    --backend vllm
    --model "${MODEL}"
    --dataset-name random
    --random-input-len 1000
    --random-output-len 1000
    --max-concurrency "${BENCH_MAX_CONCURRENCY}"
    --num-warmups "${BENCH_NUM_WARMUPS}"
    --num-prompts "${BENCH_NUM_PROMPTS}"
    --ready_check_timeout_sec 3000
    --seed 1234
)
docker exec moriio-prefill \
    vllm bench serve \
        --base-url "http://localhost:${ROUTER_PORT}" \
        "${BENCH_ARGS[@]}" 2>&1 | tee -a "${BENCH_LOG}"

---

Waiting for endpoint to become up in 3000 seconds
 |          | 00:21 elapsed, 25540:25:12 remaining
Initial test run completed.
Warming up with 512 requests...
100%|██████████| 512/512 [01:00<00:00,  8.47it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 256
 34%|███▍      | 864/2560 [01:58<02:36, 10.84it/s]
RAW_BUFFERClick to expand / collapse

Your current environment

<details> <summary>The output of <code>python collect_env.py</code></summary>
Your output of `python collect_env.py` here
</details>

Running vllm bench serve with --max-concurrency 128 or 256 @ 1k/1k ISL/OSL on a 1P1D deployment of DSR1 TP8EP8 on two 8xMI300X nodes hangs indefinitely after a while.

# ── vLLM serve flags shared between prefill and decode ───────────────────────
VLLM_SERVE_ARGS=(
    --tensor-parallel-size 8
    --kv-cache-dtype fp8
    --gpu-memory-utilization 0.7
    --max-num-batched-tokens 32768
    --max-model-len 16384
    --trust-remote-code
    --no-enable-prefix-caching
    --block-size 1
)

# ── Role-specific vLLM serve flags ───────────────────────────────────────────
PREFILL_EXTRA_ARGS=(
    --enforce-eager
)

DECODE_EXTRA_ARGS=(
    --enable-expert-parallel
    --all2all-backend mori
    --compilation-config '{"cudagraph_mode": "PIECEWISE"}'
)

PREFILL_KV_CONFIG=$(cat <<EOF
{
  "kv_connector": "MoRIIOConnector",
  "kv_role": "kv_producer",
  "kv_connector_extra_config": {
    "proxy_ip": "${PREFILL_IP}",
    "proxy_ping_port": "${PROXY_PING_PORT}",
    "http_port": "${PREFILL_PORT}",
    "handshake_port": "${HANDSHAKE_PORT}",
    "notify_port": "${NOTIFY_PORT}"
  }
}
EOF
)

DECODE_KV_CONFIG=$(cat <<EOF
{
  "kv_connector": "MoRIIOConnector",
  "kv_role": "kv_consumer",
  "kv_connector_extra_config": {
    "proxy_ip": "${PREFILL_IP}",
    "proxy_ping_port": "${PROXY_PING_PORT}",
    "http_port": "${DECODE_PORT}",
    "handshake_port": "${HANDSHAKE_PORT}",
    "notify_port": "${NOTIFY_PORT}"
  }
}
EOF
)

# ── Common docker run flags ───────────────────────────────────────────────────
VLLM_COMMON_ARGS=(
    --init
    --network host
    --ipc host
    --privileged
    --cap-add SYS_PTRACE
    --security-opt seccomp=unconfined
    --ulimit memlock=-1
    --ulimit stack=67108864
    --shm-size "${SHM_SIZE}"
    --group-add video
    --group-add render
    --device /dev/kfd
    --device /dev/dri
    --device /dev/infiniband
    -v /sys:/sys
    -v "${HF_HOME}:/root/.cache/huggingface"
    -e HF_HOME=/root/.cache/huggingface
    -e HF_HUB_ENABLE_HF_TRANSFER=0
    -e VLLM_MORIIO_CONNECTOR_READ_MODE=1
    -e NCCL_MIN_NCHANNELS=112
    -e VLLM_USE_V1=1
    -e VLLM_ENGINE_READY_TIMEOUT_S=3600
    -e VLLM_SERVER_DEV_MODE=1
    -e VLLM_ROCM_USE_AITER=1
    -e VLLM_ROCM_USE_AITER_PAGED_ATTN=0
    -e VLLM_ROCM_USE_AITER_RMSNORM=1
    -e VLLM_USE_AITER_TRITON_SILU_MUL=0
)

# On node 1, run P and the proxy
docker run -d \
    --name moriio-prefill \
    "${VLLM_COMMON_ARGS[@]}" \
    "${VLLM_IMAGE}" \
    vllm serve "${MODEL}" \
        --port "${PREFILL_PORT}" \
        "${VLLM_SERVE_ARGS[@]}" \
        "${PREFILL_EXTRA_ARGS[@]}" \
        --kv-transfer-config "${PREFILL_KV_CONFIG}"


docker run -d \
    --name moriio-toy-proxy \
    --network host \
    "${VLLM_IMAGE}" \
    bash -c "pip install --quiet --ignore-installed quart aiohttp msgpack && \
             python3 /tmp/patch_toy_proxy.py && \
             python3 -u ${TOY_PROXY_CONTAINER_PATH}"

# On node 2, run D
docker run -d \
    --name moriio-decode \
    "${VLLM_COMMON_ARGS[@]}" \
    "${VLLM_IMAGE}" \
    vllm serve "${MODEL}" \
        --port "${DECODE_PORT}" \
        "${VLLM_SERVE_ARGS[@]}" \
        "${DECODE_EXTRA_ARGS[@]}" \
        --kv-transfer-config "${DECODE_KV_CONFIG}"


# On node 1 again run the benchmark
BENCH_ARGS=(
    --backend vllm
    --model "${MODEL}"
    --dataset-name random
    --random-input-len 1000
    --random-output-len 1000
    --max-concurrency "${BENCH_MAX_CONCURRENCY}"
    --num-warmups "${BENCH_NUM_WARMUPS}"
    --num-prompts "${BENCH_NUM_PROMPTS}"
    --ready_check_timeout_sec 3000
    --seed 1234
)
docker exec moriio-prefill \
    vllm bench serve \
        --base-url "http://localhost:${ROUTER_PORT}" \
        "${BENCH_ARGS[@]}" 2>&1 | tee -a "${BENCH_LOG}"

🐛 Describe the bug

No error but hangs forever here:

Waiting for endpoint to become up in 3000 seconds
 |          | 00:21 elapsed, 25540:25:12 remaining
Initial test run completed.
Warming up with 512 requests...
100%|██████████| 512/512 [01:00<00:00,  8.47it/s]
Warmup run completed.
Starting main benchmark run...
Traffic request rate: inf
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: 256
 34%|███▍      | 864/2560 [01:58<02:36, 10.84it/s]

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

TL;DR

The issue can be resolved by adjusting the --max-concurrency flag or optimizing the system resources to handle the high concurrency level of 256.

Guidance

  • Review the system resources (e.g., CPU, memory, and network bandwidth) to ensure they can handle the high concurrency level of 256.
  • Consider reducing the --max-concurrency flag to a lower value (e.g., 128) to alleviate potential resource bottlenecks.
  • Verify that the VLLM_SERVE_ARGS and DECODE_EXTRA_ARGS are properly configured to optimize performance.
  • Check the BENCH_ARGS to ensure that the benchmarking parameters are suitable for the system resources and concurrency level.

Example

No code snippet is provided as the issue is related to configuration and system resources rather than code.

Notes

The issue may be specific to the system configuration and resources, so it's essential to monitor and adjust the resources accordingly. Additionally, the --max-concurrency flag may need to be fine-tuned based on the system's capabilities.

Recommendation

Apply a workaround by reducing the --max-concurrency flag to a lower value (e.g., 128) to alleviate potential resource bottlenecks and optimize system performance.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING