vllm - 💡(How to fix) Fix [Feature]: Support iterative in-place weight editing on TP workers (online RLHF / steering / abliteration) [1 participants]

vllm2026-04-21 17:32:25

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

vllm-project/vllm#40536•Fetched 2026-04-22 07:43:57

View on GitHub

Comments

Participants

Timeline

Reactions

Author

wuwangzhang1216

Participants

wuwangzhang1216

Fix Action

Fix / Workaround

Workarounds required + cost

Workaround	Reason	Perf cost
`enforce_eager=True`	CUDA graphs capture weight pointers at graph build time. Post-capture mutations aren't seen. Without `enforce_eager`, subsequent forwards use stale weights silently.	~30-50% throughput vs captured graphs
`VLLM_ALLOW_INSECURE_SERIALIZATION=1`	`collective_rpc` refuses to pickle arbitrary callables by default (security: arbitrary code exec risk from client). Forces users to take on ALL serialization risk for any use case.	Security posture regression
`VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton` (MoE only)	`FLASHINFER_TRTLLM` backend's `process_weights_after_loading` repacks `w2_weight` into an opaque block layout. In-place writes hit a tensor that's no longer used by the kernel — silent no-op.	~10-15% MoE throughput (triton is slower than TRTLLM)

Net: with all workarounds, we get ~60% of vLLM's potential throughput. A first-class API would let users keep CUDA graphs + secure serialization + the fastest MoE kernel.

Code Example

from vllm import LLM

llm = LLM(
    model="/path/to/model",
    tensor_parallel_size=4,
    enforce_eager=True,            # (1) REQUIRED
    # ...
)

def _worker_edit(worker, plan):
    model = worker.model_runner.model
    for layer_idx, w_new in plan.items():
        model.layers[layer_idx].self_attn.o_proj.weight.data.copy_(w_new)
    return len(plan)

# Ship edit to all TP workers — needs env var (2)
import os
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
results = llm.llm_engine.collective_rpc(_worker_edit, args=(plan,))
llm.reset_prefix_cache()  # (3) REQUIRED after each edit

---

# Proposed high-level API
result = llm.patch_weights({
    "model.layers.0.self_attn.o_proj.weight": new_weight_tensor,
    "model.layers.0.mlp.experts.w2_weight":   new_w2_weight,  # fused MoE
})
# result: {"applied": 73, "errors": [], "cudagraphs_rebuilt": True}

RAW_BUFFERClick to expand / collapse

Motivation

We built a TP=4 abliteration pipeline for openai/gpt-oss-120b (117B params, 128 experts × 36 layers, MoE) that runs 150 Optuna trials each doing:

Edit attn.o_proj.weight in-place on all TP workers
Expert-granular edits to fused mlp.experts.w2_weight (128 experts × 36 layers)
MoE router logit suppression
Run 800-prompt benchmark (100 benign + 100 harmful × 4 datasets)
Restore weights
Next trial with new edit plan

Per-trial end-to-end time dropped from ~2 min (HF pipeline-parallel) to ~60s (vLLM TP=4 + collective_rpc in-place edit) — a ~2× throughput improvement. Same pattern applies to:

Online RLHF reward-model updates (single-process, no trainer/server split)
Test-time quantization experiments
LoRA merging at runtime
Abliteration / refusal suppression research
Quantization-aware steering

What works today

from vllm import LLM

llm = LLM(
    model="/path/to/model",
    tensor_parallel_size=4,
    enforce_eager=True,            # (1) REQUIRED
    # ...
)

def _worker_edit(worker, plan):
    model = worker.model_runner.model
    for layer_idx, w_new in plan.items():
        model.layers[layer_idx].self_attn.o_proj.weight.data.copy_(w_new)
    return len(plan)

# Ship edit to all TP workers — needs env var (2)
import os
os.environ["VLLM_ALLOW_INSECURE_SERIALIZATION"] = "1"
results = llm.llm_engine.collective_rpc(_worker_edit, args=(plan,))
llm.reset_prefix_cache()  # (3) REQUIRED after each edit

Works correctly with throughput ~2.5× HF PP for 120B MoE.

Workarounds required + cost

Workaround	Reason	Perf cost
`enforce_eager=True`	CUDA graphs capture weight pointers at graph build time. Post-capture mutations aren't seen. Without `enforce_eager`, subsequent forwards use stale weights silently.	~30-50% throughput vs captured graphs
`VLLM_ALLOW_INSECURE_SERIALIZATION=1`	`collective_rpc` refuses to pickle arbitrary callables by default (security: arbitrary code exec risk from client). Forces users to take on ALL serialization risk for any use case.	Security posture regression
`VLLM_FUSED_MOE_UNQUANTIZED_BACKEND=triton` (MoE only)	`FLASHINFER_TRTLLM` backend's `process_weights_after_loading` repacks `w2_weight` into an opaque block layout. In-place writes hit a tensor that's no longer used by the kernel — silent no-op.	~10-15% MoE throughput (triton is slower than TRTLLM)

Net: with all workarounds, we get ~60% of vLLM's potential throughput. A first-class API would let users keep CUDA graphs + secure serialization + the fastest MoE kernel.

Proposal: `LLM.patch_weights(plan) -> PatchResult`

# Proposed high-level API
result = llm.patch_weights({
    "model.layers.0.self_attn.o_proj.weight": new_weight_tensor,
    "model.layers.0.mlp.experts.w2_weight":   new_w2_weight,  # fused MoE
})
# result: {"applied": 73, "errors": [], "cudagraphs_rebuilt": True}

Semantics:

Broadcast plan to all TP workers via existing TP comm (no Python pickle needed — tensors are native).
Each worker: resolve module path → locate base weight → copy_() (respecting any process_weights_after_loading re-packs).
Invalidate CUDA graphs (rebuild lazily on next forward).
reset_prefix_cache().
Return per-layer status.

This unblocks enforce_eager=False + safe serialization + FlashInfer-TRTLLM MoE for the weight-editing use case.

Complementarity with #39451

#39451 / #40096 (bedeks et al.): Trainer-process → vLLM-process over NCCL with sparse masks. Great for async RLHF where the trainer is separate.
This proposal: Driver/user code → TP workers in the same process. Great for closed-loop tools (Optuna, evolutionary search, test-time adaptation) where external trainer is overkill.

They share the "patch weights then resume inference" shape but differ in the plumbing layer. The high-level LLM.patch_weights() could be the user-facing API for BOTH — with NCCL-sparse as the backend when source is another process, and collective-broadcast as the backend when source is the driver.

Benchmark (reference, not required)

gpt-oss-120b abliteration, 4× RTX PRO 6000 96GB, TP=4, 150 Optuna trials:

Config	Per-trial time	Total wall-clock
HF pipeline-parallel	~125s	~5.2h
vLLM TP=4 + collective_rpc + enforce_eager + triton MoE (current)	~60s	~2.5h
vLLM TP=4 + `patch_weights()` (proposed, graphs on, TRTLLM MoE)	~30s est.	~1.3h est.

What would make this issue actionable

Confirmation that patch_weights() is within scope for vllm.LLM public API.
Guidance on:
- How to mark weights as "editable" so process_weights_after_loading doesn't repack (MoE specifically).
- How to invalidate + rebuild CUDA graphs without full engine re-init.
Whether this should land as a separate method or extend reload_weights().

I'm happy to prototype the driver-initiated path (broadcast plan → apply_model(fn) → reset_prefix_cache()) as a follow-up PR if the design is approved.

Evidence / code

Current abliterix implementation (for reference, not a submission):

VLLMAttentionEditor (handles fused qkv_proj slicing on TP): https://github.com/wuwangzhang1216/abliterix/blob/master/src/abliterix/core/vllm_moe_editor.py#L1365
VLLMExpertEditor (MoE w2_weight EGA): same file, earlier class
collective_rpc workflow: https://github.com/wuwangzhang1216/abliterix/blob/master/src/abliterix/core/steering.py (search _apply_direct_steering_vllm)
Abliterated model (shipped): https://huggingface.co/wangzhang/gpt-oss-120b-abliterated

extent analysis

TL;DR

Implement a patch_weights() method in the vllm.LLM class to enable efficient and secure weight editing for TP workers.

Guidance

Define the patch_weights() API: Determine the exact semantics and parameters of the patch_weights() method, including how to handle errors and invalidation of CUDA graphs.
Implement weight editing: Use the existing TP communication infrastructure to broadcast the weight editing plan to all TP workers and apply the changes to the corresponding weights.
Invalidate and rebuild CUDA graphs: Develop a mechanism to invalidate the CUDA graphs after weight editing and rebuild them lazily on the next forward pass.
Integrate with reset_prefix_cache(): Ensure that the reset_prefix_cache() method is called after weight editing to maintain consistency.

Example

class LLM:
    #...

    def patch_weights(self, plan):
        # Broadcast plan to all TP workers
        results = self.llm_engine.collective_rpc(self._apply_weight_edit, args=(plan,))

        # Invalidate CUDA graphs and rebuild lazily
        self._invalidate_cudagraphs()

        # Reset prefix cache
        self.reset_prefix_cache()

        return results

    def _apply_weight_edit(self, plan):
        # Apply weight editing plan to local model
        for layer_idx, w_new in plan.items():
            self.model_runner.model.layers[layer_idx].self_attn.o_proj.weight.data.copy_(w_new)
        return len(plan)

    def _invalidate_cudagraphs(self):
        # Invalidate CUDA graphs and rebuild lazily
        # Implementation details omitted for brevity
        pass

Notes

The implementation of patch_weights() will require careful consideration of the trade-offs between performance, security, and usability. The proposed API should be designed to accommodate various use cases, including online RLHF reward-model updates, test-time quantization experiments, and LoRA merging at runtime.

Recommendation

Apply the proposed patch_weights() method to the vllm.LLM class, as it provides a more efficient and secure way to edit weights for TP workers, allowing for better performance and usability.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #output truncation #response parsing #generation error #database connection

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

vllm - 💡(How to fix) Fix [Feature]: Support iterative in-place weight editing on TP workers (online RLHF / steering / abliteration) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Workarounds required + cost

Code Example

Motivation

What works today

Workarounds required + cost

Proposal: `LLM.patch_weights(plan) -> PatchResult`

Complementarity with #39451

Benchmark (reference, not required)

What would make this issue actionable

Evidence / code

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

vllm - 💡(How to fix) Fix [Feature]: Support iterative in-place weight editing on TP workers (online RLHF / steering / abliteration) [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Fix Action

Fix / Workaround

Workarounds required + cost

Code Example

Motivation

What works today

Workarounds required + cost

Proposal: LLM.patch_weights(plan) -> PatchResult

Complementarity with #39451

Benchmark (reference, not required)

What would make this issue actionable

Evidence / code

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING

Proposal: `LLM.patch_weights(plan) -> PatchResult`