openclaw - 💡(How to fix) Fix Realtime voice consult is too slow/fragile for live calls; add fast memory context path [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#71849Fetched 2026-04-26 05:07:31
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
labeled ×2commented ×1

voice-call realtime conversations currently expose openclaw_agent_consult directly to the realtime provider via realtime.toolPolicy. That is the right conceptual affordance, but in live phone calls it is too slow and brittle for remembered-context questions.

In testing with Twilio + Google Gemini Live realtime voice, ordinary small talk worked, but questions requiring remembered context caused one of these bad outcomes:

  1. Google Live tool-call/consult handling closed the realtime bridge before a useful answer reached the call.
  2. Embedded consult ran but exceeded useful voice latency, causing silence / repeated “still checking” responses.
  3. Partial transcript chunks prevented reliable context-trigger detection and follow-up consults.

A local prototype showed a better product shape: perform a fast memory/context lookup in the voice-call hot path, inject the retrieved snippets into the realtime session, and reserve full openclaw_agent_consult for slower fallback/deeper work.

Root Cause

A 6–15 second embedded agent/tool run may be acceptable in text chat, but in a phone call it feels broken. The caller hears silence or repeated filler. This makes realtime voice feel unreliable precisely when it needs memory most.

This is not specific to one setup. Any realtime voice surface with memory, prior context, current plans, or current-state questions needs a fast context layer before slower agentic work.

Code Example

[voice-call-debug] transcript consult trigger callId=... text=... Can you tell me the to pic of ton ight 's over night essay?
[voice-call-debug] transcript consult start callId=... question=...

{"subsystem":"agent/embedded"} embedded run timeout: runId=voice-realtime-consult:... timeoutMs=15000
{"subsystem":"agent/embedded"} ... provider='google' model='gemini-3-flash-preview' failoverReason='timeout' timedOut=true

[voice-call-debug] transcript consult result callId=... durationMs=12200 result={"text":"I need a moment to verify that before answering."}
[voice-call-debug] transcript consult injected callId=...

---

realtime: {
  enabled: true,
  provider: "google",
  toolPolicy: "safe-read-only",

  fastContext: {
    enabled: true,
    timeoutMs: 800,
    sources: ["memory", "sessions"],
    maxResults: 3,
    fallbackToConsult: true
  }
}

---

caller transcript
→ accumulate/normalize transcript safely
→ classify whether it needs remembered context
→ fastContext retrieval under hard deadline
→ inject retrieved context into realtime session
→ realtime model answers
→ fallback to openclaw_agent_consult only when needed

---

"to" + "pic"
"over" + "night"
"essa" + "y"
"world" + "ma" + "king"
RAW_BUFFERClick to expand / collapse

Summary

voice-call realtime conversations currently expose openclaw_agent_consult directly to the realtime provider via realtime.toolPolicy. That is the right conceptual affordance, but in live phone calls it is too slow and brittle for remembered-context questions.

In testing with Twilio + Google Gemini Live realtime voice, ordinary small talk worked, but questions requiring remembered context caused one of these bad outcomes:

  1. Google Live tool-call/consult handling closed the realtime bridge before a useful answer reached the call.
  2. Embedded consult ran but exceeded useful voice latency, causing silence / repeated “still checking” responses.
  3. Partial transcript chunks prevented reliable context-trigger detection and follow-up consults.

A local prototype showed a better product shape: perform a fast memory/context lookup in the voice-call hot path, inject the retrieved snippets into the realtime session, and reserve full openclaw_agent_consult for slower fallback/deeper work.

Environment

  • OpenClaw: 2026.4.24
  • Plugin: voice-call
  • Provider: twilio
  • Realtime provider: google
  • Realtime model: gemini-2.5-flash-native-audio-preview-12-2025
  • Voice: Kore
  • realtime.toolPolicy: safe-read-only
  • Public webhook: stable HTTPS tunnel to local voice-call webhook

Current behavior

When the caller asks a remembered-context question during a realtime phone call, for example:

What’s the topic of tonight’s overnight essay?

The realtime session may say something like:

Let me check for you.

Then either the bridge closes, the embedded consult times out, or the call stays alive while the model loops on filler such as “still checking” / “almost there” without producing the answer.

Representative sanitized log excerpts:

[voice-call-debug] transcript consult trigger callId=... text=... Can you tell me the to pic of ton ight 's over night essay?
[voice-call-debug] transcript consult start callId=... question=...

{"subsystem":"agent/embedded"} embedded run timeout: runId=voice-realtime-consult:... timeoutMs=15000
{"subsystem":"agent/embedded"} ... provider='google' model='gemini-3-flash-preview' failoverReason='timeout' timedOut=true

[voice-call-debug] transcript consult result callId=... durationMs=12200 result={"text":"I need a moment to verify that before answering."}
[voice-call-debug] transcript consult injected callId=...

In one live run, the caller waited more than three minutes while the voice model continued responding with variants of “still checking.”

Expected behavior

Realtime phone voice needs a much tighter latency contract than chat.

For memory/context questions, the voice-call plugin should be able to:

  1. detect that the caller’s utterance needs remembered context,
  2. perform fast memory/session-context retrieval with a hard deadline,
  3. inject the top snippets into the realtime session,
  4. let the realtime model answer naturally,
  5. fall back to full openclaw_agent_consult only when fast context is insufficient.

Why this matters

A 6–15 second embedded agent/tool run may be acceptable in text chat, but in a phone call it feels broken. The caller hears silence or repeated filler. This makes realtime voice feel unreliable precisely when it needs memory most.

This is not specific to one setup. Any realtime voice surface with memory, prior context, current plans, or current-state questions needs a fast context layer before slower agentic work.

Prototype finding

A local hot-path prototype materially improved the behavior:

  • full embedded consult path: timed out / stalled
  • fast local memory retrieval: returned relevant context in tens of milliseconds
  • realtime voice model answered the remembered topic correctly

The useful primitive seems to be:

fast context retrieval first; full agent consult later if needed.

This is not a request to replace openclaw_agent_consult. It is a request to avoid putting full agent execution in the live audio hot path when a bounded context lookup would solve the turn.

Proposed design

Add a first-class fast context resolver to voice-call realtime mode.

Config sketch:

realtime: {
  enabled: true,
  provider: "google",
  toolPolicy: "safe-read-only",

  fastContext: {
    enabled: true,
    timeoutMs: 800,
    sources: ["memory", "sessions"],
    maxResults: 3,
    fallbackToConsult: true
  }
}

Suggested flow:

caller transcript
→ accumulate/normalize transcript safely
→ classify whether it needs remembered context
→ fastContext retrieval under hard deadline
→ inject retrieved context into realtime session
→ realtime model answers
→ fallback to openclaw_agent_consult only when needed

Implementation notes from live testing

Partial transcripts

Google Live streamed user input in small chunks such as:

"to" + "pic"
"over" + "night"
"essa" + "y"
"world" + "ma" + "king"

So context detection should not depend only on clean final transcripts. It likely needs transcript accumulation plus normalization before classification/retrieval.

Multiple context questions per call

After one successful context lookup, a follow-up question should be able to trigger another lookup. The consult/context gate should reset after injection/answer and should not depend solely on a final transcript event that may not arrive.

No dead-air loop

If fast context does not return within the configured deadline, the realtime model should receive an explicit empty/failure signal and answer gracefully. It should not remain in an open-ended “still checking” state.

Google Live function response shape

During debugging, malformed/missing function response fields caused Google Live errors such as inability to parse a function response. There may also be a separate smaller bugfix needed around Google realtime function response shape for openclaw_agent_consult.

Acceptance criteria

  • Realtime voice memory/context query returns an answer or graceful miss within a configured deadline.
  • No repeated “still checking” loop after backend timeout.
  • Partial transcript chunks are accumulated/normalized well enough to detect common context-seeking utterances.
  • Multiple memory/context questions in one call can each trigger retrieval.
  • Existing openclaw_agent_consult behavior remains available as fallback.
  • Docs explain the latency tradeoff: fast context for live voice, full consult for deeper work.

Docs to update

Primary docs:

  • docs.openclaw.ai/plugins/voice-call

Potential secondary docs:

  • docs.openclaw.ai/cli/voicecall#voicecall, only if setup/smoke output gains fast-context readiness checks.

Related issues

  • #71195 — realtime Talk/macOS parity with voice-call; related realtime voice architecture, but not this phone-call memory hot-path issue.
  • #71784 — memory search reliability; related retrieval reliability but not specific to realtime voice UX.
  • #71812 — plugin runtime deps cleanup issue; unrelated to this feature, but surfaced during local testing.

extent analysis

TL;DR

Implement a fast context resolver in the voice-call plugin to reduce latency for remembered-context questions during realtime phone calls.

Guidance

  1. Add a fast context resolver: Introduce a new component that performs fast memory/context retrieval with a hard deadline to improve the responsiveness of the voice-call plugin.
  2. Configure the fast context resolver: Set up the resolver with a suitable timeout (e.g., 800ms) and specify the sources of context (e.g., memory, sessions) to ensure efficient retrieval.
  3. Integrate with the realtime model: Inject the retrieved context into the realtime session, allowing the model to answer naturally, and fall back to openclaw_agent_consult only when necessary.
  4. Handle partial transcripts and multiple context questions: Accumulate and normalize partial transcripts to detect context-seeking utterances, and enable multiple context lookups per call.
  5. Prevent dead-air loops: Ensure the realtime model receives an explicit empty/failure signal if fast context retrieval times out, allowing it to answer gracefully.

Example

realtime: {
  enabled: true,
  provider: "google",
  toolPolicy: "safe-read-only",

  fastContext: {
    enabled: true,
    timeoutMs: 800,
    sources: ["memory", "sessions"],
    maxResults: 3,
    fallbackToConsult: true
  }
}

Notes

The proposed design and implementation notes provide a solid foundation for addressing the issue. However, additional testing and refinement may be necessary to ensure the fast context resolver works seamlessly with the voice-call plugin and the Google Live realtime provider.

Recommendation

Apply the proposed design and implementation to introduce a fast context resolver in the voice-call plugin, as it addresses the latency issues with remembered-context questions during realtime phone calls.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Realtime phone voice needs a much tighter latency contract than chat.

For memory/context questions, the voice-call plugin should be able to:

  1. detect that the caller’s utterance needs remembered context,
  2. perform fast memory/session-context retrieval with a hard deadline,
  3. inject the top snippets into the realtime session,
  4. let the realtime model answer naturally,
  5. fall back to full openclaw_agent_consult only when fast context is insufficient.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Realtime voice consult is too slow/fragile for live calls; add fast memory context path [1 comments, 2 participants]