vllm - 💡(How to fix) Fix [RFC]: Support Dynamic Model Switching and Flexible Collective Communication in External Launcher Mode [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#38231Fetched 2026-04-08 01:37:11
View on GitHub
Comments
0
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
labeled ×1renamed ×1
RAW_BUFFERClick to expand / collapse

Motivation.

The current architecture of vLLM's external_launcher backend lacks the necessary flexibility to support high-efficiency resource pooling in dynamic, multi-tenant online inference scenarios.

  • Coarse-grained Resource Granularity: Resource pooling is essential for maximizing hardware efficiency in multi-task environments. However, vLLM’s default multiprocessing backend couples fixed resource pools to specific instances. This rigidity prevents the fluid reallocation of resources to models with varying parallelism strategies (e.g., repartitioning a TP=8 cluster into multiple TP=2 groups) without significant system friction.
  • Initialization Bottlenecks: While the external_launcher backend allows for granular accelerator mapping, it is currently restricted to offline inference. Switching models requires terminating and respawning launcher processes, which triggers a full re-initialization of the software stack. For example, import torch/torch_npu can take approximately 9s, incurring a massive cold-start penalty that fails to meet the latency requirements of real-time online services.
  • Static Communication Groups: Current external_launcher implementations lack the capability to dynamically rebuild collective communication groups (e.g., NCCL/HCCL). This prevents the system from adapting to live changes in model parallelism or elastic scaling without a full restart of the execution environment.

Proposed Change.

We propose upgrading the external_launcher from a "one-shot" execution model to a persistent online model capable of:

  • Process Resident Reuse: Keeping the launcher processes alive across different inference sessions to bypass the startup penalty.
  • Dynamic Model Switching: Implementing a signaling mechanism to trigger model offloading/loading within the existing process boundary.
  • Flexible Collective Communication: Providing APIs to tear down and rebuild communication groups (e.g., re-initializing ProcessGroup or NCCL/HCCL contexts) based on new parallelism requirements.

Feedback Period.

No response

CC List.

No response

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

extent analysis

Fix Plan

To address the issues with the external_launcher backend, we will implement the following changes:

  • Process Resident Reuse: Modify the launcher process to remain alive across different inference sessions.
  • Dynamic Model Switching: Implement a signaling mechanism to trigger model offloading/loading within the existing process boundary.
  • Flexible Collective Communication: Provide APIs to tear down and rebuild communication groups based on new parallelism requirements.

Example Code Changes

import torch
import torch.distributed as dist

# Initialize the launcher process
class LauncherProcess:
    def __init__(self):
        self.model_loaded = False
        self.comm_group = None

    def load_model(self, model_name):
        # Load the model and initialize the communication group
        self.model = torch.load(model_name)
        self.comm_group = dist.new_group(ranks=[0])

    def unload_model(self):
        # Unload the model and destroy the communication group
        self.model = None
        dist.destroy_process_group(self.comm_group)
        self.comm_group = None

    def run_inference(self, input_data):
        # Run inference using the loaded model
        if not self.model_loaded:
            raise Exception("Model not loaded")
        output = self.model(input_data)
        return output

# Create a launcher process and keep it alive
launcher = LauncherProcess()

# Load a model and run inference
launcher.load_model("model1.pt")
output = launcher.run_inference(input_data)

# Unload the model and load a new one
launcher.unload_model()
launcher.load_model("model2.pt")
output = launcher.run_inference(input_data)

Verification

To verify that the fix worked, you can test the following scenarios:

  • Load a model and run inference multiple times to ensure that the launcher process remains alive.
  • Unload a model and load a new one to ensure that the communication group is rebuilt correctly.
  • Test the performance of the system to ensure that the startup penalty is bypassed.

Extra Tips

  • Make sure to handle errors and exceptions properly when loading and unloading models.
  • Consider implementing a caching mechanism to store loaded models and reduce the overhead of loading and unloading.
  • Use the torch.distributed module to manage the communication group and ensure that it is properly destroyed when the model is unloaded.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING