litellm - ✅(Solved) Fix WebSocket /v1/responses requires ?model= query param, breaking OpenAI spec compatibility [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
BerriAI/litellm#25532Fetched 2026-04-11 06:13:33
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
labeled ×1

Root Cause

In litellm/proxy/response_api_endpoints/endpoints.py:

@router.websocket("/v1/responses")
@router.websocket("/responses")
async def responses_websocket_endpoint(
    websocket: WebSocket,
    model: str = fastapi.Query(
        ..., description="The model to use for the responses WebSocket session."
    ),
    user_api_key_dict=Depends(user_api_key_auth_websocket),
):

model: str = fastapi.Query(...) makes it a required URL query parameter. When a client connects without it, FastAPI returns 403 at the framework level before the handler or auth logic runs.

Fix Action

Fix / Workaround

This WORKS (non-spec workaround)

async with connect(
    "ws://localhost:4000/v1/responses?model=gpt-5.4",
    additional_headers={"Authorization": "Bearer sk-1234"},
) as ws:
    await ws.send('{"type":"response.create","model":"gpt-5.4","input":"say hi"}')

PR fix notes

PR #25702: fix(responses-ws): make model query param optional, extract from first message

Description (problem / solution / changelog)

Summary

The WebSocket endpoint for /v1/responses requires model as a mandatory URL query parameter (fastapi.Query(...)). However, the OpenAI WebSocket spec sends model inside the response.create message payload, not as a URL query param.

This means any spec-compliant client (e.g. Codex CLI) connecting to ws://.../v1/responses without ?model= gets a 403 Forbidden, because FastAPI rejects the missing required query param before the handler or auth logic even runs.

Changes

  1. _ReplayWebSocket wrapper class: thin wrapper that replays one pre-read message on the first receive_text() call, so downstream handlers (ManagedResponsesWebSocketHandler, ResponsesWebSocketStreaming) see the message as normal.

  2. Make model query param optional (default None): clients can still pass ?model= for backward compatibility.

  3. Extract model from first message: when model is None, the handler reads the first WebSocket message after accept(), extracts the model from the response.create payload (supporting both nested {"response": {"model": "..."}} and flat {"model": "..."} formats), wraps the websocket with _ReplayWebSocket so the message is not lost, then proceeds with normal pre-call processing and routing.

Reproduction

import asyncio
from websockets.asyncio.client import connect

async def main():
    # Before fix: 403 Forbidden
    # After fix: connects and streams normally
    async with connect(
        "ws://localhost:4000/v1/responses",
        additional_headers={"Authorization": "Bearer sk-1234"},
    ) as ws:
        await ws.send('{"type":"response.create","response":{"model":"gpt-4o","input":"say hi"}}')
        async for msg in ws:
            print(msg)

asyncio.run(main())

Testing

Verified end-to-end on a live GKE cluster:

  • Deployed patched LiteLLM v1.82.3 pod using init-container overlay
  • Pointed Codex CLI 0.118.0 at the patched proxy
  • Codex connected via WebSocket (no 403), model extracted from first response.create message
  • LLM call routed successfully, full response received (input_tokens: 31496, output_tokens: 178)
  • ?model= query param still works (backward compatible)

Related

  • Fixes #25532
  • Related to #22051 (WebSocket mode support)

Changed files

  • litellm/proxy/response_api_endpoints/endpoints.py (modified, +75/-2)

Code Example

@router.websocket("/v1/responses")
@router.websocket("/responses")
async def responses_websocket_endpoint(
    websocket: WebSocket,
    model: str = fastapi.Query(
        ..., description="The model to use for the responses WebSocket session."
    ),
    user_api_key_dict=Depends(user_api_key_auth_websocket),
):

---

import asyncio
from websockets.asyncio.client import connect

async def main():
    # This FAILS with 403 (how Codex CLI and spec-compliant clients connect)
    async with connect(
        "ws://localhost:4000/v1/responses",
        additional_headers={"Authorization": "Bearer sk-1234"},
    ) as ws:
        await ws.send('{"type":"response.create","model":"gpt-5.4","input":"say hi"}')

    # This WORKS (non-spec workaround)
    async with connect(
        "ws://localhost:4000/v1/responses?model=gpt-5.4",
        additional_headers={"Authorization": "Bearer sk-1234"},
    ) as ws:
        await ws.send('{"type":"response.create","model":"gpt-5.4","input":"say hi"}')

asyncio.run(main())
RAW_BUFFERClick to expand / collapse

Bug

The WebSocket endpoint for /v1/responses requires model as a URL query parameter (fastapi.Query(...)), but the OpenAI WebSocket spec sends model inside the response.create message payload, not as a query param.

This means any spec-compliant client (e.g. Codex CLI) connecting to ws://.../v1/responses without ?model= gets a 403 Forbidden — FastAPI rejects the missing required query param before auth even runs.

Root cause

In litellm/proxy/response_api_endpoints/endpoints.py:

@router.websocket("/v1/responses")
@router.websocket("/responses")
async def responses_websocket_endpoint(
    websocket: WebSocket,
    model: str = fastapi.Query(
        ..., description="The model to use for the responses WebSocket session."
    ),
    user_api_key_dict=Depends(user_api_key_auth_websocket),
):

model: str = fastapi.Query(...) makes it a required URL query parameter. When a client connects without it, FastAPI returns 403 at the framework level before the handler or auth logic runs.

Reproduction

import asyncio
from websockets.asyncio.client import connect

async def main():
    # This FAILS with 403 (how Codex CLI and spec-compliant clients connect)
    async with connect(
        "ws://localhost:4000/v1/responses",
        additional_headers={"Authorization": "Bearer sk-1234"},
    ) as ws:
        await ws.send('{"type":"response.create","model":"gpt-5.4","input":"say hi"}')

    # This WORKS (non-spec workaround)
    async with connect(
        "ws://localhost:4000/v1/responses?model=gpt-5.4",
        additional_headers={"Authorization": "Bearer sk-1234"},
    ) as ws:
        await ws.send('{"type":"response.create","model":"gpt-5.4","input":"say hi"}')

asyncio.run(main())

Expected behavior

The endpoint should accept model from the response.create WebSocket message payload (matching the OpenAI spec), not require it as a URL query parameter. The model query param should either be optional or removed entirely.

Impact

  • Codex CLI always connects to bare ws://.../v1/responses per the OpenAI spec → gets 403 → retries 5 times (~6-8 seconds) → falls back to HTTP. This adds ~6-8s latency per turn.
  • Any client following the OpenAI WebSocket spec will hit the same issue.

Environment

  • LiteLLM version: tested on both v1.82.3-stable and v1.83.5-nightly (same behavior)
  • Codex CLI: @openai/[email protected]
  • Running LiteLLM as proxy in router mode with models configured via config.yaml

Related issues

  • #22051 — WebSocket mode support for Responses API (covers streaming bugs after connection, not the connection failure itself)
  • #18456 — WebSocket auth header issue for /v1/realtime (different endpoint)

extent analysis

TL;DR

The WebSocket endpoint for /v1/responses should be modified to accept the model from the response.create message payload instead of requiring it as a URL query parameter.

Guidance

  • Modify the responses_websocket_endpoint function to remove the model query parameter requirement, allowing the endpoint to accept the model from the WebSocket message payload.
  • Update the endpoint to parse the model from the response.create message payload and use it for the WebSocket session.
  • Consider adding validation to ensure the model is provided in the message payload to maintain security and functionality.
  • Review related issues, such as #22051 and #18456, to ensure that the changes do not introduce new bugs or conflicts with existing functionality.

Example

@router.websocket("/v1/responses")
@router.websocket("/responses")
async def responses_websocket_endpoint(
    websocket: WebSocket,
    user_api_key_dict=Depends(user_api_key_auth_websocket),
):
    # Parse the model from the response.create message payload
    async for message in websocket:
        if message.type == "response.create":
            model = message.json()["model"]
            # Use the model for the WebSocket session
            # ...

Notes

The provided solution assumes that the model is always provided in the response.create message payload. Additional validation and error handling may be necessary to ensure the endpoint behaves correctly in all scenarios.

Recommendation

Apply a workaround by modifying the responses_websocket_endpoint function to accept the model from the message payload, as this change aligns with the OpenAI WebSocket spec and resolves the connection issue for spec-compliant clients.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

The endpoint should accept model from the response.create WebSocket message payload (matching the OpenAI spec), not require it as a URL query parameter. The model query param should either be optional or removed entirely.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING