openclaw - ✅(Solved) Fix [Bug]: voice-call OpenAI realtime transcription times out during Twilio media stream while direct WebSocket succeeds [1 pull requests, 2 comments, 3 participants]

openclaw2026-04-30 17:50:09

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#75197•Fetched 2026-05-01 05:36:58

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Participants

Timeline (top)

mentioned ×4subscribed ×4commented ×2closed ×1

Twilio inbound voice-call media streams connect and initial TTS plays, but OpenAI realtime transcription times out during the live call, so caller speech is never transcribed or routed to the agent.

Error Message

07:55:06 [voice-call] Inbound call accepted: +<PHONE_NUMBER_REDACTED> is in allowlist 07:55:06 [voice-call] Created inbound call record: 41be546b-d1db-4f1a-b613-b4155a8821db from +<PHONE_NUMBER_REDACTED> 07:55:07 [MediaStream] Twilio connected 07:55:07 [MediaStream] Stream started: MZd0ddb4a2aa6561e185e88e481c1523b0 (call: CA0c67464cb2ddbccd522404560efbe0e5) 07:55:07 [voice-call] Media stream connected: CA0c67464cb2ddbccd522404560efbe0e5 -> MZd0ddb4a2aa6561e185e88e481c1523b0 07:55:07 [voice-call] Speaking initial message for call 41be546b-d1db-4f1a-b613-b4155a8821db (mode: conversation) 07:55:19 [MediaStream] Transcription session error: OpenAI realtime transcription connection timeout 07:55:19 [MediaStream] STT connection failed (TTS still works): OpenAI realtime transcription connection timeout 07:57:04 [MediaStream] Stream stopped: MZd0ddb4a2aa6561e185e88e481c1523b0 07:57:04 [voice-call] Media stream disconnected: CA0c67464cb2ddbccd522404560efbe0e5 (MZd0ddb4a2aa6561e185e88e481c1523b0) 07:57:05 [MediaStream] WebSocket closed (code: 1005, reason: none) 07:57:06 [voice-call] Auto-ending call 41be546b-d1db-4f1a-b613-b4155a8821db after stream disconnect grace

Persisted call record evidence shows only the bot greeting transcript, with no user transcript:

{ "callId": "41be546b-d1db-4f1a-b613-b4155a8821db", "state": "speaking", "transcript": [ { "speaker": "bot", "text": "Hello! How can I help you today?", "isFinal": true } ] }

Root Cause

Affected: voice-call plugin users using Twilio inbound calls with OpenAI realtime transcription. Severity: High; inbound conversation mode is unusable because caller speech is not transcribed. Frequency: Observed repeatedly across multiple inbound call attempts in this setup. Consequence: Calls connect and may play the greeting, but the assistant cannot hear/respond to the caller.

Fix Action

Fixed

Fixed by PR: fix(voice-call): await STT readiness before initial greeting (#75197) (https://github.com/openclaw/openclaw/pull/75257)

PR fix notes

PR #75257: fix(voice-call): await STT readiness before initial greeting (#75197)

Repository: openclaw/openclaw
Author: PfanP
State: open | merged: False
Link: https://github.com/openclaw/openclaw/pull/75257

Description (problem / solution / changelog)

The Twilio media-stream startup raced TTS playback against the OpenAI realtime transcription WebSocket handshake: handleStart called onConnect (which fires manager.speakInitialMessage immediately) and then started sttSession.connect() fire-and-forget. Under event-loop contention from TTS startup the STT WS handshake timed out at 10s, leaving the call half-functional - greeting played, caller speech never reached the agent - while a direct OpenAI realtime WebSocket probe from the same host succeeded in ~1.1s.

Establish STT readiness before firing onConnect so TTS startup cannot starve the STT handshake. When the STT connect rejects, close the STT session, end the Twilio media stream with a 1011 close code, and fire onDisconnect so the voice-call manager hangs up the call on the existing grace path instead of silently leaving the caller on a deaf stream.

Fixes #75197.

Summary

Describe the problem and fix in 2–5 bullets:

If this PR fixes a plugin beta-release blocker, title it fix(<plugin-id>): beta blocker - <summary> and link the matching Beta blocker: <plugin-name> - <summary> issue labeled beta-blocker. Contributors cannot label PRs, so the title is the PR-side signal for maintainers and automation.

Problem:
Why it matters:
What changed:
What did NOT change (scope boundary):

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Closes #
Related #
This PR fixes a bug or regression

Root Cause (if applicable)

For bug fixes or regressions, explain why this happened, not just what changed. Otherwise write N/A. If the cause is unclear, write Unknown.

Root cause:
Missing detection / guardrail:
Contributing context (if known):

Regression Test Plan (if applicable)

For bug fixes or regressions, name the smallest reliable test coverage that should catch this. Otherwise write N/A.

Coverage level that should have caught this:
- Unit test
- Seam / integration test
- End-to-end test
- Existing coverage already sufficient
Target test or file:
Scenario the test should lock in:
Why this is the smallest reliable guardrail:
Existing test that already covers this (if any):
If no new test is added, why not:

User-visible / Behavior Changes

List user-visible changes (including defaults/config).
If none, write None.

Diagram (if applicable)

For UI changes or non-trivial logic flows, include a small ASCII diagram reviewers can scan quickly. Otherwise write N/A.

Before:
[user action] -> [old state]

After:
[user action] -> [new state] -> [result]

Security Impact (required)

New permissions/capabilities? (Yes/No)
Secrets/tokens handling changed? (Yes/No)
New/changed network calls? (Yes/No)
Command/tool execution surface changed? (Yes/No)
Data access scope changed? (Yes/No)
If any Yes, explain risk + mitigation:

Repro + Verification

Environment

OS:
Runtime/container:
Model/provider:
Integration/channel (if any):
Relevant config (redacted):

Steps

Expected

Actual

Evidence

Attach at least one:

Failing test/log before + passing after
Trace/log snippets
Screenshot/recording
Perf numbers (if relevant)

Human Verification (required)

What you personally verified (not just CI), and how:

Verified scenarios:
Edge cases checked:
What you did not verify:

Review Conversations

I replied to or resolved every bot review conversation I addressed in this PR.
I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

Backward compatible? (Yes/No)
Config/env changes? (Yes/No)
Migration needed? (Yes/No)
If yes, exact upgrade steps:

Risks and Mitigations

List only real risks for this PR. Add/remove entries as needed. If none, write None.

Risk:
- Mitigation:

Changed files

CHANGELOG.md (modified, +1/-0)
extensions/voice-call/src/media-stream.test.ts (modified, +110/-0)
extensions/voice-call/src/media-stream.ts (modified, +29/-7)

Code Example

{
  "provider": "twilio",
  "publicUrl": "https://<tailscale-host>/voice/webhook",
  "serve": {
    "port": 3334,
    "bind": "127.0.0.1",
    "path": "/voice/webhook"
  },
  "inboundPolicy": "allowlist",
  "streaming": {
    "enabled": true,
    "provider": "openai",
    "streamPath": "/voice/stream",
    "providers": {
      "openai": {
        "apiKey": "***",
        "model": "gpt-4o-transcribe",
        "silenceDurationMs": 800,
        "vadThreshold": 0.5
      }
    }
  },
  "realtime": {
    "enabled": false
  },
  "tts": {
    "provider": "openai",
    "providers": {
      "openai": {
        "apiKey": "***",
        "model": "gpt-4o-mini-tts",
        "voice": "alloy"
      }
    },
    "timeoutMs": 30000
  }
}

---

07:55:06 [voice-call] Inbound call accepted: +<PHONE_NUMBER_REDACTED> is in allowlist
07:55:06 [voice-call] Created inbound call record: 41be546b-d1db-4f1a-b613-b4155a8821db from +<PHONE_NUMBER_REDACTED>
07:55:07 [MediaStream] Twilio connected
07:55:07 [MediaStream] Stream started: MZd0ddb4a2aa6561e185e88e481c1523b0 (call: CA0c67464cb2ddbccd522404560efbe0e5)
07:55:07 [voice-call] Media stream connected: CA0c67464cb2ddbccd522404560efbe0e5 -> MZd0ddb4a2aa6561e185e88e481c1523b0
07:55:07 [voice-call] Speaking initial message for call 41be546b-d1db-4f1a-b613-b4155a8821db (mode: conversation)
07:55:19 [MediaStream] Transcription session error: OpenAI realtime transcription connection timeout
07:55:19 [MediaStream] STT connection failed (TTS still works): OpenAI realtime transcription connection timeout
07:57:04 [MediaStream] Stream stopped: MZd0ddb4a2aa6561e185e88e481c1523b0
07:57:04 [voice-call] Media stream disconnected: CA0c67464cb2ddbccd522404560efbe0e5 (MZd0ddb4a2aa6561e185e88e481c1523b0)
07:57:05 [MediaStream] WebSocket closed (code: 1005, reason: none)
07:57:06 [voice-call] Auto-ending call 41be546b-d1db-4f1a-b613-b4155a8821db after stream disconnect grace

Persisted call record evidence shows only the bot greeting transcript, with no user transcript:

{
  "callId": "41be546b-d1db-4f1a-b613-b4155a8821db",
  "state": "speaking",
  "transcript": [
    {
      "speaker": "bot",
      "text": "Hello! How can I help you today?",
      "isFinal": true
    }
  ]
}

RAW_BUFFERClick to expand / collapse

Bug type

Behavior bug (incorrect output/state without crash)

Beta release blocker

Summary

Twilio inbound voice-call media streams connect and initial TTS plays, but OpenAI realtime transcription times out during the live call, so caller speech is never transcribed or routed to the agent.

Steps to reproduce

Steps to reproduce:

Start OpenClaw 2026.4.27 on Ubuntu 24.04 with voice-call enabled.
Configure voice-call with Twilio, streaming.enabled: true, streaming.provider: "openai", and streaming.providers.openai.model: "gpt-4o-transcribe".
Configure TTS with OpenAI gpt-4o-mini-tts.
Call the configured Twilio number from an allowlisted caller.
Observe the Twilio media stream connect.
Speak during or after the initial greeting.

Expected behavior

After Twilio media stream connects, OpenAI realtime transcription should connect successfully, caller speech should be transcribed, and the transcript should be routed to the voice-call agent response path.

Actual behavior

The Twilio media stream connects and the initial greeting eventually plays, but STT fails with OpenAI realtime transcription connection timeout. No user transcript is recorded, and the call remains effectively deaf until disconnect/end.

OpenClaw version

2026.4.27

Operating system

Ubuntu 24.04.4 LTS / Linux 6.8.0-110-generic x86_64

Install method

npm global

Model

Voice-call streaming STT: openai/gpt-4o-transcribe, Voice-call TTS: openai/gpt-4o-mini-tts, agent model codex-5.5

Provider / routing chain

Twilio inbound call -> Tailscale Funnel HTTPS/WSS -> OpenClaw voice-call webhook/media stream -> OpenAI Realtime transcription API

Additional provider/model setup details

Relevant redacted voice-call config:

{
  "provider": "twilio",
  "publicUrl": "https://<tailscale-host>/voice/webhook",
  "serve": {
    "port": 3334,
    "bind": "127.0.0.1",
    "path": "/voice/webhook"
  },
  "inboundPolicy": "allowlist",
  "streaming": {
    "enabled": true,
    "provider": "openai",
    "streamPath": "/voice/stream",
    "providers": {
      "openai": {
        "apiKey": "***",
        "model": "gpt-4o-transcribe",
        "silenceDurationMs": 800,
        "vadThreshold": 0.5
      }
    }
  },
  "realtime": {
    "enabled": false
  },
  "tts": {
    "provider": "openai",
    "providers": {
      "openai": {
        "apiKey": "***",
        "model": "gpt-4o-mini-tts",
        "voice": "alloy"
      }
    },
    "timeoutMs": 30000
  }
}

Direct probes from the same machine succeeded:

OpenAI gpt-4o-mini-tts request returned 200 in about 1.2s.
Direct OpenAI realtime transcription WebSocket opened and returned transcription_session.created in about 1.1s.

Logs, screenshots, and evidence

07:55:06 [voice-call] Inbound call accepted: +<PHONE_NUMBER_REDACTED> is in allowlist
07:55:06 [voice-call] Created inbound call record: 41be546b-d1db-4f1a-b613-b4155a8821db from +<PHONE_NUMBER_REDACTED>
07:55:07 [MediaStream] Twilio connected
07:55:07 [MediaStream] Stream started: MZd0ddb4a2aa6561e185e88e481c1523b0 (call: CA0c67464cb2ddbccd522404560efbe0e5)
07:55:07 [voice-call] Media stream connected: CA0c67464cb2ddbccd522404560efbe0e5 -> MZd0ddb4a2aa6561e185e88e481c1523b0
07:55:07 [voice-call] Speaking initial message for call 41be546b-d1db-4f1a-b613-b4155a8821db (mode: conversation)
07:55:19 [MediaStream] Transcription session error: OpenAI realtime transcription connection timeout
07:55:19 [MediaStream] STT connection failed (TTS still works): OpenAI realtime transcription connection timeout
07:57:04 [MediaStream] Stream stopped: MZd0ddb4a2aa6561e185e88e481c1523b0
07:57:04 [voice-call] Media stream disconnected: CA0c67464cb2ddbccd522404560efbe0e5 (MZd0ddb4a2aa6561e185e88e481c1523b0)
07:57:05 [MediaStream] WebSocket closed (code: 1005, reason: none)
07:57:06 [voice-call] Auto-ending call 41be546b-d1db-4f1a-b613-b4155a8821db after stream disconnect grace

Persisted call record evidence shows only the bot greeting transcript, with no user transcript:

{
  "callId": "41be546b-d1db-4f1a-b613-b4155a8821db",
  "state": "speaking",
  "transcript": [
    {
      "speaker": "bot",
      "text": "Hello! How can I help you today?",
      "isFinal": true
    }
  ]
}

Impact and severity

Additional information

A direct OpenAI realtime transcription WebSocket probe from the same host succeeds quickly, so this does not appear to be basic OpenAI network reachability. The failure appears specific to the live voice-call media stream runtime path.

Potentially relevant observation: the initial greeting begins immediately after media stream connect, while STT connection is still pending. In observed calls, STT times out and user speech is never captured.

extent analysis

TL;DR

The most likely fix is to adjust the timing of the STT connection to ensure it establishes before the initial greeting is played, potentially by introducing a delay or reordering the operations.

Guidance

Review the streaming configuration to ensure that the silenceDurationMs and vadThreshold settings are appropriate for the specific use case, as these may impact the STT connection timeout.
Consider adding a delay between the media stream connection and the initial greeting playback to allow the STT connection to establish, potentially using a setTimeout or similar mechanism.
Investigate the possibility of reordering the operations to prioritize the STT connection establishment before playing the initial greeting.
Verify that the OpenAI API keys and model configurations are correct and consistent across the application.

Example

"streaming": {
  "enabled": true,
  "provider": "openai",
  "streamPath": "/voice/stream",
  "providers": {
    "openai": {
      "apiKey": "***",
      "model": "gpt-4o-transcribe",
      "silenceDurationMs": 1000, // adjusted silence duration
      "vadThreshold": 0.5
    }
  },
  "delayBeforeGreetingMs": 2000 // introduced delay before greeting
}

Notes

The provided logs and configuration suggest that the issue is specific to the live voice-call media stream runtime path, and the direct OpenAI realtime transcription WebSocket probe succeeds quickly. Therefore, the focus should be on adjusting the timing and configuration of the STT connection within the application.

Recommendation

Apply a workaround by introducing a delay before playing the initial greeting, allowing the STT connection to establish. This can be done by adding a delayBeforeGreetingMs setting to the streaming configuration and using it to pause the playback of the initial greeting.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

FAQ

Expected behavior

#api #batch processing #GPU compatibility #latency issue #model loading

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

openclaw - ✅(Solved) Fix [Bug]: voice-call OpenAI realtime transcription times out during Twilio media stream while direct WebSocket succeeds [1 pull requests, 2 comments, 3 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Fix Action

Fixed

PR fix notes

PR #75257: fix(voice-call): await STT readiness before initial greeting (#75197)

Description (problem / solution / changelog)

Summary

Change Type (select all)

Scope (select all touched areas)

Linked Issue/PR

Root Cause (if applicable)

Regression Test Plan (if applicable)

User-visible / Behavior Changes

Diagram (if applicable)

Security Impact (required)

Repro + Verification

Environment

Steps

Expected

Actual

Evidence

Human Verification (required)

Review Conversations

Compatibility / Migration

Risks and Mitigations

Changed files

Code Example

Bug type

Beta release blocker

Summary

Steps to reproduce

Expected behavior

Actual behavior

OpenClaw version

Operating system

Install method

Model

Provider / routing chain

Additional provider/model setup details

Logs, screenshots, and evidence

Impact and severity

Additional information

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

FAQ

Expected behavior

Still need to ship something?

RELATED_DISCOVERY

TRENDING