vllm - ✅(Solved) Fix NixlConnector hardcodes backends=["UCX"] default; no env-var override path; LIBFABRIC/EFA operators must discover kv_connector_extra_config.backends from source [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#41814Fetched 2026-05-07 03:32:49
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1referenced ×1

Error Message

  1. Surface a startup WARN when NIXL_BACKEND env is set but is being ignored because the connector is bypassing it — catches the common operator mistake.

Root Cause

  1. Read NIXL_BACKEND / VLLM_NIXL_BACKEND from env as a fallback when kv_connector_extra_config.backends is absent — consistent with how NCCL + other vLLM networking knobs are exposed.
  2. Document the kv_connector_extra_config.backends path prominently in docs/source/models/kv_transfer.md (or equivalent). Currently the JSON schema is discoverable only by reading connector source.
  3. Surface a startup WARN when NIXL_BACKEND env is set but is being ignored because the connector is bypassing it — catches the common operator mistake.

Fix Action

Fix / Workaround

✅ Workaround

Once the workaround is applied, LIBFABRIC instantiates and EFA handshakes succeed.

PR fix notes

PR #72: Add Dynamo combined image (vLLM + TRT-LLM) with EFA/NIXL RDMA

Description (problem / solution / changelog)

Summary

  • What: Adds a self-contained Dockerfile and deployment manifests for a combined Dynamo inference image containing both vLLM 0.17.1 and TRT-LLM 1.3.0rc7 backends with NIXL 0.10.1 KV-cache transfer over AWS EFA RDMA.
  • Why: A single image simplifies deployment for disaggregated inference workloads that need backend flexibility. Instead of maintaining separate vLLM and TRT-LLM images, operators deploy one image and select the backend at runtime (python -m dynamo.vllm or python -m dynamo.trtllm).
  • Image: public.ecr.aws/v9l4g5s4/dynamo-combined:latest (~35 GB)
  • Tested on: 2x P5en.48xlarge (32x H200, 32x EFA) running disaggregated inference with Nemotron-Mini-4B-Instruct

Changes

New files

FileDescription
Dockerfile.dynamo-combined-efa7-stage multi-stage build from NGC base images (no dependency on the existing Dockerfile.efa base)
k8s/dynamo-combined-disagg-1gpu.yamlK8s manifest: 1-GPU prefill + 1-GPU decode with EFA
k8s/dynamo-combined-disagg-8gpu.yamlK8s manifest: 8-GPU DP prefill + 8-GPU DP decode with 16 EFA rails
sbom/dynamo-combined-sbom.csvSoftware Bill of Materials (530+ Python + system packages)
sbom/dynamo-combined-pip-freeze.txtFull pip freeze output

Modified files

FileChange
README.mdAdded combined image build/deploy docs, K8s deployment section, EFA/NIXL env var reference
build.shAdded combined build target (./build.sh -b combined)
ATTRIBUTION.mdAdded GDRCopy, FlashInfer, LMCache, FFmpeg attributions

Architecture

The Dockerfile uses a 7-stage multi-stage build:

  1. dynamo_base -- Rust 1.93.1, NATS v2.10.28, etcd v3.5.21, uv, sccache
  2. wheel_builder_base -- UCX v1.20.x (EFA/GDRCopy/CUDA), libfabric v2.3.0 (EFA provider), GDRCopy v2.5.1, FFmpeg 7.1, AWS SDK C++
  3. wheel_builder -- NIXL 0.10.1 native + Python wheels, Dynamo runtime wheels
  4. pytorch_base -- NGC PyTorch 25.12 (torch 2.10.0)
  5. trtllm_framework -- TRT-LLM 1.3.0rc7 + TensorRT 10.14 in venv
  6. vllm_framework -- vLLM 0.17.1 + FlashInfer 0.6.4 + LMCache 0.4.1
  7. final -- Combined runtime: TRT-LLM venv as base, vLLM packages overlaid, NIXL + UCX + libfabric + EFA installer, SBOM generation

Key design decisions

  • Self-contained build: Does not depend on the existing Dockerfile.efa base image. Builds UCX, libfabric, NIXL, and EFA from source for full version control.
  • Shared PyTorch: Both vLLM and TRT-LLM share the same NGC PyTorch (2.10.0) to avoid conflicts. vLLM-specific packages are overlaid on top of TRT-LLM's venv.
  • EFA-first networking: NIXL is configured with libfabric transport (NIXL_BACKEND=LIBFABRIC) for direct EFA RDMA KV-cache transfer between nodes.
  • SBOM included: /SBOM.txt and /THIRD-PARTY-LICENSES are generated inside the image at build time.

Test plan

  • Built and pushed to ECR (public.ecr.aws/v9l4g5s4/dynamo-combined:latest)
  • Tested disaggregated inference (prefill + decode) with Nemotron-Mini-4B on 2x P5en.48xlarge
  • Verified NIXL KV-cache transfer over EFA RDMA (NIXL_BACKEND=LIBFABRIC)
  • Verified both backends: python -m dynamo.trtllm and python -m dynamo.vllm
  • Verified K8s manifests deploy correctly on EKS with EFA device plugin
  • Community review of Dockerfile conventions and documentation

Changed files

  • .gitignore (added, +2/-0)
  • 2.projects/dynamo-inference/ATTRIBUTION.md (modified, +29/-1)
  • 2.projects/dynamo-inference/Dockerfile.dynamo-combined-efa (added, +443/-0)
  • 2.projects/dynamo-inference/Dockerfile.dynamo-trtllm-efa (modified, +41/-0)
  • 2.projects/dynamo-inference/Dockerfile.dynamo-vllm-efa (modified, +41/-0)
  • 2.projects/dynamo-inference/Dockerfile.efa (modified, +405/-519)
  • 2.projects/dynamo-inference/Dockerfile.overlay (added, +263/-0)
  • 2.projects/dynamo-inference/LICENSE (added, +21/-0)
  • 2.projects/dynamo-inference/README.md (modified, +210/-9)
  • 2.projects/dynamo-inference/REPRODUCIBILITY.md (added, +185/-0)
  • 2.projects/dynamo-inference/THIRD-PARTY-LICENSES (added, +2309/-0)
  • 2.projects/dynamo-inference/UTILITY-LICENSES (added, +35/-0)
  • 2.projects/dynamo-inference/build.sh (modified, +199/-85)
  • 2.projects/dynamo-inference/buildspec.yml (added, +80/-0)
  • 2.projects/dynamo-inference/ci/CODEBUILD-SETUP.md (added, +273/-0)
  • 2.projects/dynamo-inference/ci/codebuild-post-build.sh (added, +74/-0)
  • 2.projects/dynamo-inference/ci/commercial-licenses.md (added, +58/-0)
  • 2.projects/dynamo-inference/entrypoint.sh (added, +52/-0)
  • 2.projects/dynamo-inference/k8s/dgd-dynamo-combined-trtllm.yaml (added, +239/-0)
  • 2.projects/dynamo-inference/k8s/dgd-dynamo-combined-vllm.yaml (added, +266/-0)
  • 2.projects/dynamo-inference/k8s/legacy/dynamo-combined-disagg-1gpu.yaml (added, +263/-0)
  • 2.projects/dynamo-inference/k8s/legacy/dynamo-combined-disagg-8gpu.yaml (added, +294/-0)
  • 2.projects/dynamo-inference/sbom/CVE-SUMMARY.md (added, +63/-0)
  • 2.projects/dynamo-inference/sbom/README.md (added, +100/-0)
  • 2.projects/dynamo-inference/sbom/awsi-dynamo-combined-efa-v8/awsi-dynamo-combined-efa_v8.cyclonedx.json (added, +1/-0)
  • 2.projects/dynamo-inference/sbom/awsi-dynamo-combined-efa-v8/awsi-dynamo-combined-efa_v8.licenses.md (added, +16279/-0)
  • 2.projects/dynamo-inference/sbom/awsi-dynamo-combined-efa-v8/awsi-dynamo-combined-efa_v8.spdx.json (added, +1/-0)
  • 2.projects/dynamo-inference/sbom/awsi-dynamo-combined-efa-v8/awsi-dynamo-combined-efa_v8.trivy-cve-critical-high.txt (added, +2725/-0)
  • 2.projects/dynamo-inference/sbom/awsi-dynamo-combined-efa-v9/awsi-dynamo-combined-efa_v9.trivy-cve-critical-high.txt (added, +2725/-0)
  • 2.projects/dynamo-inference/sbom/awsi-efa-base-v1/awsi-efa-base_v1.cyclonedx.json (added, +1/-0)
  • 2.projects/dynamo-inference/sbom/awsi-efa-base-v1/awsi-efa-base_v1.licenses.md (added, +24327/-0)
  • 2.projects/dynamo-inference/sbom/awsi-efa-base-v1/awsi-efa-base_v1.spdx.json (added, +1/-0)
  • 2.projects/dynamo-inference/sbom/awsi-efa-base-v1/awsi-efa-base_v1.trivy-cve-critical-high.txt (added, +596/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-combined-efa-v1/dynamo-combined-efa_v1.cyclonedx.json (added, +80224/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-combined-efa-v1/dynamo-combined-efa_v1.licenses.md (added, +708/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-combined-efa-v1/dynamo-combined-efa_v1.spdx.json (added, +85027/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-trtllm-efa-v1/dynamo-trtllm-efa_v1.cyclonedx.json (added, +80190/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-trtllm-efa-v1/dynamo-trtllm-efa_v1.licenses.md (added, +708/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-trtllm-efa-v1/dynamo-trtllm-efa_v1.spdx.json (added, +84496/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-trtllm-v4/dynamo-trtllm_v4.cyclonedx.json (added, +77473/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-trtllm-v4/dynamo-trtllm_v4.licenses.md (added, +1673/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-trtllm-v4/dynamo-trtllm_v4.spdx.json (added, +115225/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-vllm-efa-v1/dynamo-vllm-efa_v1.cyclonedx.json (added, +79950/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-vllm-efa-v1/dynamo-vllm-efa_v1.licenses.md (added, +708/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-vllm-efa-v1/dynamo-vllm-efa_v1.spdx.json (added, +84067/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-vllm-v4/dynamo-vllm_v4.cyclonedx.json (added, +57915/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-vllm-v4/dynamo-vllm_v4.licenses.md (added, +1370/-0)
  • 2.projects/dynamo-inference/sbom/dynamo-vllm-v4/dynamo-vllm_v4.spdx.json (added, +77670/-0)
  • 2.projects/dynamo-inference/sbom/efa-base-v1/efa-base_v1.cyclonedx.json (added, +30936/-0)
  • 2.projects/dynamo-inference/sbom/efa-base-v1/efa-base_v1.licenses.md (added, +686/-0)
  • 2.projects/dynamo-inference/sbom/efa-base-v1/efa-base_v1.spdx.json (added, +24248/-0)
  • 2.projects/dynamo-inference/sbom/networking-base-v5/networking-base_v5.cyclonedx.json (added, +35544/-0)
  • 2.projects/dynamo-inference/sbom/networking-base-v5/networking-base_v5.licenses.md (added, +904/-0)
  • 2.projects/dynamo-inference/sbom/networking-base-v5/networking-base_v5.spdx.json (added, +32075/-0)
  • 2.projects/dynamo-inference/sbom/trivy/dynamo-combined-efa-v1-cve.txt (added, +13/-0)
  • 2.projects/dynamo-inference/sbom/trivy/dynamo-trtllm-efa-v1-cve.txt (added, +8/-0)
  • 2.projects/dynamo-inference/sbom/trivy/dynamo-trtllm-v4-cve.txt (added, +1072/-0)
  • 2.projects/dynamo-inference/sbom/trivy/dynamo-vllm-efa-v1-cve.txt (added, +8/-0)
  • 2.projects/dynamo-inference/sbom/trivy/dynamo-vllm-v4-cve.txt (added, +1392/-0)
  • 2.projects/dynamo-inference/sbom/trivy/efa-base-v1-cve.txt (added, +9/-0)
  • 2.projects/dynamo-inference/sbom/trivy/networking-base-v5-cve.txt (added, +341/-0)
  • 2.projects/dynamo-inference/scripts/audit.py (added, +93/-0)
  • 2.projects/dynamo-inference/scripts/build-orchestrator.sh (added, +80/-0)
  • 2.projects/dynamo-inference/scripts/efa/detect-efa.sh (added, +9/-0)
  • 2.projects/dynamo-inference/scripts/efa/efatop.sh (added, +18/-0)
  • 2.projects/dynamo-inference/scripts/sbom.sh (added, +70/-0)
  • 2.projects/dynamo-inference/tests/README.md (added, +120/-0)
  • 2.projects/dynamo-inference/tests/e2e-evidence/awsi-dynamo-combined-efa_v1_vllm-inference.md (added, +44/-0)
  • 2.projects/dynamo-inference/tests/e2e-evidence/awsi-dynamo-combined-efa_v9_2node-nccl-rdma.md (added, +75/-0)
  • 2.projects/dynamo-inference/tests/e2e-evidence/awsi-efa-base_v1_rdma-validation.md (added, +85/-0)
  • 2.projects/dynamo-inference/tests/e2e-evidence/nixl-multinode-2h200.md (added, +37/-0)
  • 2.projects/dynamo-inference/tests/multinode/nccl-allreduce.yaml (added, +88/-0)
  • 2.projects/dynamo-inference/tests/smoke/smoke-pod.yaml (added, +88/-0)
  • 2.projects/dynamo-inference/tests/smoke/smoke.sh (added, +238/-0)
  • docs/KR-2.3-trtllm-on-1.2.0-decision.md (added, +111/-0)
  • docs/OKRS-2026-05-06.md (added, +129/-0)
  • docs/T12-HYPOTHESES-AND-FINDINGS.md (added, +244/-0)
  • docs/evidence/multinode-2026-05-05-rev2/README.md (added, +104/-0)
  • docs/evidence/multinode-2026-05-05-rev2/t11-torch-allreduce.py (added, +55/-0)
  • docs/evidence/multinode-2026-05-05-rev2/t12-decode-full.log (added, +129/-0)
  • docs/evidence/multinode-2026-05-05-rev2/t12-dgd-applied.yaml (added, +272/-0)
  • docs/evidence/multinode-2026-05-05-rev2/t12-dgds.txt (added, +4/-0)
  • docs/evidence/multinode-2026-05-05-rev2/t12-pods.txt (added, +3/-0)
  • docs/evidence/multinode-2026-05-05-rev2/t12-prefill-full.log (added, +141/-0)
  • docs/evidence/multinode-2026-05-05-rev3/README.md (added, +96/-0)
  • docs/evidence/multinode-2026-05-05/README.md (added, +113/-0)
  • docs/evidence/multinode-2026-05-05/nccl-pods.txt (added, +3/-0)
  • docs/evidence/multinode-2026-05-05/t11-cross-node.txt (added, +2/-0)
  • docs/evidence/multinode-2026-05-05/t11-efa-proof.txt (added, +30/-0)
  • docs/evidence/multinode-2026-05-05/t11-intra-node.txt (added, +1/-0)
  • docs/evidence/multinode-2026-05-05/t11-pod0-env.txt (added, +4/-0)
  • docs/evidence/multinode-2026-05-05/t11-pods.txt (added, +3/-0)
  • docs/evidence/multinode-2026-05-05/t11-rank0-full.log (added, +183/-0)
  • docs/evidence/multinode-2026-05-05/t11-results.json (added, +41/-0)
  • docs/evidence/multinode-2026-05-05/t11-torch-allreduce.py (added, +55/-0)
  • docs/evidence/multinode-2026-05-05/t12-dgd-patched.yaml (added, +275/-0)
  • docs/evidence/multinode-2026-05-05/t12-dgds.txt (added, +3/-0)
  • docs/evidence/multinode-2026-05-05/t12-pods.txt (added, +1/-0)
  • docs/evidence/multinode-2026-05-05/t12-prefill-full.log (added, +95/-0)
  • docs/evidence/multinode-2026-05-06-rev4/README.md (added, +70/-0)

Code Example

# nixl_connector.py:1022-1024
self.nixl_backends = vllm_config.kv_transfer_config.get_from_extra_config(
    "backends", ["UCX"]
)

---

NIXL INFO    _api.py:361 Backend UCX was instantiated
...
NIXL transfer failure: handshake_failed

---

--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'
RAW_BUFFERClick to expand / collapse

Your current environment

  • vLLM 0.17.1 (bundled in nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.0)
  • Dynamo 1.1.0 runtime (for disaggregated prefill/decode serving)
  • NIXL nixl_cu12 1.0.1 with LIBFABRIC + UCX plugins both present on disk
  • AWS EFA hardware (SRD transport, libfabric provider efa)
  • 2× P5.48xlarge H100 HyperPod nodes
  • libplugin_LIBFABRIC.so and libplugin_UCX.so both available at /opt/dynamo/venv/lib/python3.12/site-packages/.nixl_cu12.mesonpy.libs/plugins/

🐛 Describe the bug

NixlConnector hardcodes backends=["UCX"] as the default in vllm/distributed/kv_transfer/kv_connector/v1/nixl_connector.py:1023. There is no environment-variable fallback for backend selection, and this default is not documented.

Operators running vLLM + NIXL on EFA (where UCX can't establish cross-node handshakes, but libfabric works) have no way to switch backends short of reading the source. The NIXL library itself supports multiple backends — the limitation is entirely in vLLM's default.

# nixl_connector.py:1022-1024
self.nixl_backends = vllm_config.kv_transfer_config.get_from_extra_config(
    "backends", ["UCX"]
)

Setting NIXL_BACKEND=LIBFABRIC or VLLM_NIXL_KVCACHE_BACKEND=LIBFABRIC as environment variables does nothing — neither string appears in nixl_connector.py. Operators routinely discover this by tailing logs and seeing:

NIXL INFO    _api.py:361 Backend UCX was instantiated
...
NIXL transfer failure: handshake_failed

🛠️ How to reproduce

  1. Deploy disaggregated vLLM with NixlConnector on AWS EFA (or any non-RDMA-over-Ethernet fabric where UCX can't handshake).
  2. Set --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'.
  3. Set env NIXL_BACKEND=LIBFABRIC and FI_PROVIDER=efa on both workers.
  4. Observe: NIXL _api.py:361 Backend UCX was instantiated followed by cross-node handshake_failed.

✅ Workaround

The only way to override the default today is via kv_connector_extra_config.backends:

--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both","kv_connector_extra_config":{"backends":["LIBFABRIC"]}}'

Once the workaround is applied, LIBFABRIC instantiates and EFA handshakes succeed.

💡 Suggested fix

Either:

  1. Read NIXL_BACKEND / VLLM_NIXL_BACKEND from env as a fallback when kv_connector_extra_config.backends is absent — consistent with how NCCL + other vLLM networking knobs are exposed.
  2. Document the kv_connector_extra_config.backends path prominently in docs/source/models/kv_transfer.md (or equivalent). Currently the JSON schema is discoverable only by reading connector source.
  3. Surface a startup WARN when NIXL_BACKEND env is set but is being ignored because the connector is bypassing it — catches the common operator mistake.

Evidence

Reproducible from commit <will-fill-in-once-filed> on our PR branch. Verified against nvcr.io/nvidia/ai-dynamo/vllm-runtime:1.1.0. Cross-references:

  • Downstream blocker: aws-samples/awsome-inference#72 — this limitation was the real root cause mis-diagnosed as a Dynamo operator issue in the related ai-dynamo/dynamo#9200.

cc whomever owns the NixlConnector module.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING