vllm - 💡(How to fix) Fix [Feature]: Support iterative in-place weight editing on TP workers (online RLHF / steering / abliteration) [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
vllm-project/vllm#40536Fetched 2026-04-22 07:43:57
View on GitHub
Comments
0
Participants
1
Timeline
0
Reactions
0

Fix Action

Fix / Workaround

Workarounds required + cost

WorkaroundReasonPerf cost
enforce_eager=TrueCUDA graphs capture weight pointers at graph build time. Post-capture mutations aren't seen. Without enforce_eager, subsequent forwards use stale weights silently.~30-50% throughput vs captured graphs
VLLM_ALLOW_INSECURE_SERIALIZATION=1collective_rpc refuses to pickle arbitrary callables by default (security: arbitrary code exec risk from client). Forces users to take on ALL serialization risk for any use case.Security posture regression
VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton (MoE only)FLASHINFER_TRTLLM backend's process_weights_after_loading repacks w2_weight into an opaque block layout. In-place writes hit a tensor that's no longer used by the kernel — silent no-op.~10-15% MoE throughput (triton is slower than TRTLLM)

Net: with all workarounds, we get ~60% of vLLM's potential throughput. A first-class API would let users keep CUDA graphs + secure serialization + the fastest MoE kernel.

Code Example

from vllm import LLM

llm = LLM(
    model="/path/to/model",
    tensor_parallel_size=4,
    enforce_eager=True,            # (1) REQUIRED
    # ...
)

def _worker_edit(worker, plan):
    model = worker.model_runner.model
    for layer_idx, w_new in plan.items():
        model.layers[layer_idx].self_attn.o_proj.weight.data.copy_(w_new)
    return len(plan)

# Ship edit to all TP workers — needs env var (2)
import os
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
results = llm.llm_engine.collective_rpc(_worker_edit, args=(plan,))
llm.reset_prefix_cache()  # (3) REQUIRED after each edit

---

# Proposed high-level API
result = llm.patch_weights({
    "model.layers.0.self_attn.o_proj.weight": new_weight_tensor,
    "model.layers.0.mlp.experts.w2_weight":   new_w2_weight,  # fused MoE
})
# result: {"applied": 73, "errors": [], "cudagraphs_rebuilt": True}
RAW_BUFFERClick to expand / collapse

Motivation

We built a TP=4 abliteration pipeline for openai/gpt-oss-120b (117B params, 128 experts × 36 layers, MoE) that runs 150 Optuna trials each doing:

  1. Edit attn.o_proj.weight in-place on all TP workers
  2. Expert-granular edits to fused mlp.experts.w2_weight (128 experts × 36 layers)
  3. MoE router logit suppression
  4. Run 800-prompt benchmark (100 benign + 100 harmful × 4 datasets)
  5. Restore weights
  6. Next trial with new edit plan

Per-trial end-to-end time dropped from ~2 min (HF pipeline-parallel) to ~60s (vLLM TP=4 + collective_rpc in-place edit) — a ~2× throughput improvement. Same pattern applies to:

  • Online RLHF reward-model updates (single-process, no trainer/server split)
  • Test-time quantization experiments
  • LoRA merging at runtime
  • Abliteration / refusal suppression research
  • Quantization-aware steering

What works today

from vllm import LLM

llm = LLM(
    model="/path/to/model",
    tensor_parallel_size=4,
    enforce_eager=True,            # (1) REQUIRED
    # ...
)

def _worker_edit(worker, plan):
    model = worker.model_runner.model
    for layer_idx, w_new in plan.items():
        model.layers[layer_idx].self_attn.o_proj.weight.data.copy_(w_new)
    return len(plan)

# Ship edit to all TP workers — needs env var (2)
import os
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
results = llm.llm_engine.collective_rpc(_worker_edit, args=(plan,))
llm.reset_prefix_cache()  # (3) REQUIRED after each edit

Works correctly with throughput ~2.5× HF PP for 120B MoE.

Workarounds required + cost

WorkaroundReasonPerf cost
enforce_eager=TrueCUDA graphs capture weight pointers at graph build time. Post-capture mutations aren't seen. Without enforce_eager, subsequent forwards use stale weights silently.~30-50% throughput vs captured graphs
VLLM_ALLOW_INSECURE_SERIALIZATION=1collective_rpc refuses to pickle arbitrary callables by default (security: arbitrary code exec risk from client). Forces users to take on ALL serialization risk for any use case.Security posture regression
VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton (MoE only)FLASHINFER_TRTLLM backend's process_weights_after_loading repacks w2_weight into an opaque block layout. In-place writes hit a tensor that's no longer used by the kernel — silent no-op.~10-15% MoE throughput (triton is slower than TRTLLM)

Net: with all workarounds, we get ~60% of vLLM's potential throughput. A first-class API would let users keep CUDA graphs + secure serialization + the fastest MoE kernel.

Proposal: LLM.patch_weights(plan) -> PatchResult

# Proposed high-level API
result = llm.patch_weights({
    "model.layers.0.self_attn.o_proj.weight": new_weight_tensor,
    "model.layers.0.mlp.experts.w2_weight":   new_w2_weight,  # fused MoE
})
# result: {"applied": 73, "errors": [], "cudagraphs_rebuilt": True}

Semantics:

  1. Broadcast plan to all TP workers via existing TP comm (no Python pickle needed — tensors are native).
  2. Each worker: resolve module path → locate base weight → copy_() (respecting any process_weights_after_loading re-packs).
  3. Invalidate CUDA graphs (rebuild lazily on next forward).
  4. reset_prefix_cache().
  5. Return per-layer status.

This unblocks enforce_eager=False + safe serialization + FlashInfer-TRTLLM MoE for the weight-editing use case.

Complementarity with #39451

  • #39451 / #40096 (bedeks et al.): Trainer-process → vLLM-process over NCCL with sparse masks. Great for async RLHF where the trainer is separate.
  • This proposal: Driver/user code → TP workers in the same process. Great for closed-loop tools (Optuna, evolutionary search, test-time adaptation) where external trainer is overkill.

They share the "patch weights then resume inference" shape but differ in the plumbing layer. The high-level LLM.patch_weights() could be the user-facing API for BOTH — with NCCL-sparse as the backend when source is another process, and collective-broadcast as the backend when source is the driver.

Benchmark (reference, not required)

gpt-oss-120b abliteration, 4× RTX PRO 6000 96GB, TP=4, 150 Optuna trials:

ConfigPer-trial timeTotal wall-clock
HF pipeline-parallel~125s~5.2h
vLLM TP=4 + collective_rpc + enforce_eager + triton MoE (current)~60s~2.5h
vLLM TP=4 + patch_weights() (proposed, graphs on, TRTLLM MoE)~30s est.~1.3h est.

What would make this issue actionable

  1. Confirmation that patch_weights() is within scope for vllm.LLM public API.
  2. Guidance on:
    • How to mark weights as "editable" so process_weights_after_loading doesn't repack (MoE specifically).
    • How to invalidate + rebuild CUDA graphs without full engine re-init.
  3. Whether this should land as a separate method or extend reload_weights().

I'm happy to prototype the driver-initiated path (broadcast plan → apply_model(fn)reset_prefix_cache()) as a follow-up PR if the design is approved.

Evidence / code

Current abliterix implementation (for reference, not a submission):

extent analysis

TL;DR

Implement a patch_weights() method in the vllm.LLM class to enable efficient and secure weight editing for TP workers.

Guidance

  1. Define the patch_weights() API: Determine the exact semantics and parameters of the patch_weights() method, including how to handle errors and invalidation of CUDA graphs.
  2. Implement weight editing: Use the existing TP communication infrastructure to broadcast the weight editing plan to all TP workers and apply the changes to the corresponding weights.
  3. Invalidate and rebuild CUDA graphs: Develop a mechanism to invalidate the CUDA graphs after weight editing and rebuild them lazily on the next forward pass.
  4. Integrate with reset_prefix_cache(): Ensure that the reset_prefix_cache() method is called after weight editing to maintain consistency.

Example

class LLM:
    #...

    def patch_weights(self, plan):
        # Broadcast plan to all TP workers
        results = self.llm_engine.collective_rpc(self._apply_weight_edit, args=(plan,))

        # Invalidate CUDA graphs and rebuild lazily
        self._invalidate_cudagraphs()

        # Reset prefix cache
        self.reset_prefix_cache()

        return results

    def _apply_weight_edit(self, plan):
        # Apply weight editing plan to local model
        for layer_idx, w_new in plan.items():
            self.model_runner.model.layers[layer_idx].self_attn.o_proj.weight.data.copy_(w_new)
        return len(plan)

    def _invalidate_cudagraphs(self):
        # Invalidate CUDA graphs and rebuild lazily
        # Implementation details omitted for brevity
        pass

Notes

The implementation of patch_weights() will require careful consideration of the trade-offs between performance, security, and usability. The proposed API should be designed to accommodate various use cases, including online RLHF reward-model updates, test-time quantization experiments, and LoRA merging at runtime.

Recommendation

Apply the proposed patch_weights() method to the vllm.LLM class, as it provides a more efficient and secure way to edit weights for TP workers, allowing for better performance and usability.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Support iterative in-place weight editing on TP workers (online RLHF / steering / abliteration) [1 participants]