vllm - 💡(How to fix) Fix [RFC] External post-generation classifier hook API

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Root Cause

  1. Monkey-patch build_app() in vllm.entrypoints.openai.api_server to inject a Starlette middleware that wraps the response. Fragile across vLLM minor releases and breaks streaming=True because the response body is already partially flushed by the time the middleware sees it.
  2. Run vLLM behind a reverse proxy that re-parses the OpenAI JSON, calls the classifier, and re-serializes. Doubles the request latency and loses access to per-engine signals (prompt logprobs, KV state, etc).
  3. Subclass OpenAIServingChat / OpenAIServingCompletion and override create_chat_completion. Survives one or two releases at most because vLLM iterates fast on those classes.

Fix Action

Fix / Workaround

  1. Monkey-patch build_app() in vllm.entrypoints.openai.api_server to inject a Starlette middleware that wraps the response. Fragile across vLLM minor releases and breaks streaming=True because the response body is already partially flushed by the time the middleware sees it.

  2. Run vLLM behind a reverse proxy that re-parses the OpenAI JSON, calls the classifier, and re-serializes. Doubles the request latency and loses access to per-engine signals (prompt logprobs, KV state, etc).

  3. Subclass OpenAIServingChat / OpenAIServingCompletion and override create_chat_completion. Survives one or two releases at most because vLLM iterates fast on those classes.

  4. Generation completes (or finishes a token boundary, for streaming).

  5. The engine builds a ScoringContext.

  6. All registered hooks are invoked concurrently with asyncio.gather, each bounded by its declared timeout.

  7. Hook return values are merged into RequestOutput.metadata["external_scores"].

  8. If any blocking hook returned a "block" decision, the response is replaced before serialization.

  9. The OpenAI serving layer serializes the modified response.

Code Example

from typing import Protocol, Optional, Any

class ScoringContext:
    request_id: str
    prompt: str
    generated_text: str
    extra_fields: dict[str, Any]   # from OpenAIBaseModel.get_extra_fields()
    finish_reason: str
    prompt_token_ids: list[int]
    output_token_ids: list[int]
    request_metadata: dict[str, Any]   # opaque per-request state

class ExternalClassifierHook(Protocol):
    name: str
    blocking: bool                  # True = decision; False = annotation only
    timeout_ms: int                 # per-call timeout

    async def score(self, ctx: ScoringContext) -> dict[str, Any]:
        """Return a dict of scores. If `blocking` is True and the dict
        contains the key 'block' with a truthy value, the response is
        replaced with `dict['replacement']` or with a default refusal."""
        ...

# Registration on LLMEngine / AsyncLLMEngine:
engine.register_external_classifier_hook(hook)

# Propagation: every classifier's output appears under
# RequestOutput.metadata["external_scores"][hook.name].
RAW_BUFFERClick to expand / collapse

Motivation

Many production deployments need to attach a post-generation classifier to vLLM so that every completion is scored for safety, hallucination, audit, or A/B routing before the response is returned to the client. Today the only ways to do this are:

  1. Monkey-patch build_app() in vllm.entrypoints.openai.api_server to inject a Starlette middleware that wraps the response. Fragile across vLLM minor releases and breaks streaming=True because the response body is already partially flushed by the time the middleware sees it.
  2. Run vLLM behind a reverse proxy that re-parses the OpenAI JSON, calls the classifier, and re-serializes. Doubles the request latency and loses access to per-engine signals (prompt logprobs, KV state, etc).
  3. Subclass OpenAIServingChat / OpenAIServingCompletion and override create_chat_completion. Survives one or two releases at most because vLLM iterates fast on those classes.

None of these survive a vLLM upgrade comfortably. The current state pushes safety vendors into vLLM-version pinning and that hurts the ecosystem.

What is missing is a small, public, opt-in hook surface where an external classifier registers a callable that receives the request, the generated text, and the engine's metadata for that request, and either returns scores (non-blocking) or returns a decision (blocking, with an optional replacement body).

Proposed API

from typing import Protocol, Optional, Any

class ScoringContext:
    request_id: str
    prompt: str
    generated_text: str
    extra_fields: dict[str, Any]   # from OpenAIBaseModel.get_extra_fields()
    finish_reason: str
    prompt_token_ids: list[int]
    output_token_ids: list[int]
    request_metadata: dict[str, Any]   # opaque per-request state

class ExternalClassifierHook(Protocol):
    name: str
    blocking: bool                  # True = decision; False = annotation only
    timeout_ms: int                 # per-call timeout

    async def score(self, ctx: ScoringContext) -> dict[str, Any]:
        """Return a dict of scores. If `blocking` is True and the dict
        contains the key 'block' with a truthy value, the response is
        replaced with `dict['replacement']` or with a default refusal."""
        ...

# Registration on LLMEngine / AsyncLLMEngine:
engine.register_external_classifier_hook(hook)

# Propagation: every classifier's output appears under
# RequestOutput.metadata["external_scores"][hook.name].

Lifecycle

  1. Generation completes (or finishes a token boundary, for streaming).
  2. The engine builds a ScoringContext.
  3. All registered hooks are invoked concurrently with asyncio.gather, each bounded by its declared timeout.
  4. Hook return values are merged into RequestOutput.metadata["external_scores"].
  5. If any blocking hook returned a "block" decision, the response is replaced before serialization.
  6. The OpenAI serving layer serializes the modified response.

For streaming responses, blocking hooks are invoked at end-of-stream and can replace the final chunk only. Hooks that want to inspect mid-stream tokens go through the (future) forward-hook API, not this one.

Non-goals

  • Mid-stream token mutation. That belongs in a separate RFC about the forward-hook API.
  • Synchronous hooks. Everything is async to allow concurrent classifiers without head-of-line blocking.
  • Registering hooks via env var or CLI. Registration is programmatic (engine.register_external_classifier_hook(hook)) and is meant to be invoked by a vllm.general_plugins entry-point.
  • New auth/quota surface. Hooks run with the same trust level as the engine; vendors are expected to harden their own hook code.

Alternatives considered

  • Middleware-only. Cannot see prompt_token_ids, prompt logprobs, or the per-request metadata the engine keeps. Also bypassed for /v1/completions raw text path.
  • gRPC sidecar. Adds a hop, complicates deployment, and makes the per-token APIs (logprobs) unavailable.
  • RequestOutput post-processor decorator. Closer in spirit but does not address streaming or the blocking case, and forces every consumer to register at app build time.

Open questions

  1. Should score() be allowed to set sampling parameters for a retry? Useful for "if the first generation was flagged, regenerate at temperature=0" but it complicates the contract considerably.
  2. Should hook outputs be serialized into the OpenAI response automatically (under a metadata field) or kept on the Python side only? My current preference is the latter so that we do not pollute the OpenAI schema.
  3. Where exactly to mount the registration: on LLMEngine, on AsyncLLMEngine, on the OpenAIServing* classes, or all of the above? My current preference is AsyncLLMEngine because that is what every entrypoint constructs and it shares the lifetime of the engine.
  4. Should the hook be allowed to access the KV cache? Probably no for v1, yes for a future RFC.

Prior art

  • Anthropic's "guardrails" approach, conceptually similar but server-internal.
  • NeMo Guardrails, runs in front of the model; this RFC would let it run inside.
  • Llama Guard, currently has to be wired manually by every vLLM user.

Happy to iterate on the shape. The minimum viable API for me is "an async callable that runs after generation and whose output ends up on RequestOutput.metadata". Everything else (timeouts, blocking semantics, replacement body) is negotiable.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

vllm - 💡(How to fix) Fix [RFC] External post-generation classifier hook API