openclaw - 💡(How to fix) Fix Feature Request: Voice Recognition / Speaker Identification for Voice Interface

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Fix Action

Fix / Workaround

Medium — valuable for real-world voice usability, but existing workarounds (push-to-talk, wake-word silence window) mitigate the most urgent cases.

Code Example

{
  "voice": {
    "speakerRecognition": {
      "enabled": true,         // default false
      "model": "on-device",   // "on-device" | "provider-diarization"
      "provider": "deepgram", // used if model === "provider-diarization"
      "threshold": 0.75,       // cosine similarity threshold for on-device
      "enrollmentSample": "<path or base64>",
      "enrolledAt": "2026-01-01T00:00:00Z"
    }
  }
}
RAW_BUFFERClick to expand / collapse

Feature Request: Voice Recognition / Speaker Identification

Problem

When using OpenClaw's voice interface (Voice Wake, Push-to-Talk, or Talk mode), background voices — such as TV, family members, or other people in the room — can inadvertently trigger responses or contaminate voice sessions. There is currently no way to train the system to recognize and exclusively respond to the primary user's voice, ignoring all other speakers.

Expected Behavior

  • User Enrollment: On first use (or via settings), the user records a short voice sample (e.g., 10–30 seconds of reading a standardized phrase) to build a voice profile
  • Exclusive Recognition: During Voice Wake, PTT, and Talk mode, only the enrolled user's voice triggers or is acted upon; other speakers are ignored
  • Background Suppression: TV, radio, meeting-room participants, or phone calls in the background are suppressed and do not produce spurious wake events or transcriptions
  • Fallback: If the enrolled voice cannot be matched above a confidence threshold, the system falls back to the existing behavior (e.g., silence timeout, manual trigger)
  • Multi-user (future): Ideally, the system could support multiple enrolled voices, enabling group or shared-device scenarios

Rationale

  1. Privacy: Households with shared workspaces, open-plan offices, or frequent video calls create situations where non-primary voices should not trigger AI actions
  2. Accuracy: A speaker-identified transcript is cleaner — no cross-talk, no third-party audio bleeding into context
  3. Security: Preventing unauthorized voices from triggering commands reduces attack surface, especially if device automation (home control, messaging) is linked
  4. User Experience: In noisy real-world environments (home with TV, café, office), voice activation without speaker filtering is unreliable and frustrating

Suggested Implementation Notes

Approach 1: On-Device Speaker Embedding (Recommended)

  • Use a lightweight speaker embedding model such as Resemblyzer, Silero VAD with speaker diarization, or Apple's VoicePrint framework (on macOS)
  • Capture a 20–30 second enrollment sample → generate a fixed-dimension embedding vector stored locally in ~/.openclaw/settings/voiceprint.json
  • At runtime: for each voice segment detected by VAD, compare against the stored embedding using cosine similarity; threshold to decide accept/reject
  • Pros: fully local, no API dependency, fast, privacy-preserving
  • Cons: enrollment step required; quality of enrollment sample matters

Approach 2: Provider-Based Speaker Diarization

  • Route audio through a provider that supports speaker diarization (e.g., Deepgram, AssemblyAI, Google Cloud Speech-to-Text with speaker diarization)
  • Filter transcript segments to only include the primary speaker's diarization label
  • Pros: no enrollment step, leverages existing provider infrastructure
  • Cons: adds latency, cost per request, audio must leave the device; not suitable for fully local workflows

Enrollment UX

  • New settings panel in macOS app: Voice → Voice Recognition → Enroll My Voice
  • Guided 3-step flow: (1) explain feature, (2) record sample in quiet environment, (3) confirm enrollment
  • Visual feedback: waveform during recording, confirmation on success
  • Option to re-enroll at any time

Config Surface

{
  "voice": {
    "speakerRecognition": {
      "enabled": true,         // default false
      "model": "on-device",   // "on-device" | "provider-diarization"
      "provider": "deepgram", // used if model === "provider-diarization"
      "threshold": 0.75,       // cosine similarity threshold for on-device
      "enrollmentSample": "<path or base64>",
      "enrolledAt": "2026-01-01T00:00:00Z"
    }
  }
}

VAD Integration

  • Speaker recognition should run on top of the existing VAD pipeline
  • Only accepted segments (matching enrolled speaker) should trigger the wake-word/silence-send logic
  • Rejected segments are dropped silently

Fallback Behavior

  • If enrollment is not complete or model is unavailable, fall back to existing behavior (wake word / silence timeout only)
  • A warning in the Voice Wake overlay if speaker recognition is enabled but not yet enrolled

Environment

  • macOS (Voice Wake + Push-to-Talk + Talk mode — primary use case)
  • iOS / Android (Talk mode) — native OS voice frameworks may offer built-in solutions
  • OpenClaw version: 2026.5.27+

Priority

Medium — valuable for real-world voice usability, but existing workarounds (push-to-talk, wake-word silence window) mitigate the most urgent cases.


Requested by: iqbalbhawana / Commander Iqbal Bhawana

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Feature Request: Voice Recognition / Speaker Identification for Voice Interface