openclaw - 💡(How to fix) Fix Feature Request: Voice Recognition / Speaker Identification for Voice Interface

Code Example

{
  "voice": {
    "speakerRecognition": {
      "enabled": true,         // default false
      "model": "on-device",   // "on-device" | "provider-diarization"
      "provider": "deepgram", // used if model === "provider-diarization"
      "threshold": 0.75,       // cosine similarity threshold for on-device
      "enrollmentSample": "<path or base64>",
      "enrolledAt": "2026-01-01T00:00:00Z"
    }
  }
}

Feature Request: Voice Recognition / Speaker Identification

Problem

When using OpenClaw's voice interface (Voice Wake, Push-to-Talk, or Talk mode), background voices — such as TV, family members, or other people in the room — can inadvertently trigger responses or contaminate voice sessions. There is currently no way to train the system to recognize and exclusively respond to the primary user's voice, ignoring all other speakers.

Expected Behavior

User Enrollment: On first use (or via settings), the user records a short voice sample (e.g., 10–30 seconds of reading a standardized phrase) to build a voice profile
Exclusive Recognition: During Voice Wake, PTT, and Talk mode, only the enrolled user's voice triggers or is acted upon; other speakers are ignored
Background Suppression: TV, radio, meeting-room participants, or phone calls in the background are suppressed and do not produce spurious wake events or transcriptions
Fallback: If the enrolled voice cannot be matched above a confidence threshold, the system falls back to the existing behavior (e.g., silence timeout, manual trigger)
Multi-user (future): Ideally, the system could support multiple enrolled voices, enabling group or shared-device scenarios

Rationale

Privacy: Households with shared workspaces, open-plan offices, or frequent video calls create situations where non-primary voices should not trigger AI actions
Accuracy: A speaker-identified transcript is cleaner — no cross-talk, no third-party audio bleeding into context
Security: Preventing unauthorized voices from triggering commands reduces attack surface, especially if device automation (home control, messaging) is linked
User Experience: In noisy real-world environments (home with TV, café, office), voice activation without speaker filtering is unreliable and frustrating

Suggested Implementation Notes

Approach 1: On-Device Speaker Embedding (Recommended)

Use a lightweight speaker embedding model such as Resemblyzer, Silero VAD with speaker diarization, or Apple's VoicePrint framework (on macOS)
Capture a 20–30 second enrollment sample → generate a fixed-dimension embedding vector stored locally in ~/.openclaw/settings/voiceprint.json
At runtime: for each voice segment detected by VAD, compare against the stored embedding using cosine similarity; threshold to decide accept/reject
Pros: fully local, no API dependency, fast, privacy-preserving
Cons: enrollment step required; quality of enrollment sample matters

Approach 2: Provider-Based Speaker Diarization

Route audio through a provider that supports speaker diarization (e.g., Deepgram, AssemblyAI, Google Cloud Speech-to-Text with speaker diarization)
Filter transcript segments to only include the primary speaker's diarization label
Pros: no enrollment step, leverages existing provider infrastructure
Cons: adds latency, cost per request, audio must leave the device; not suitable for fully local workflows

Enrollment UX

New settings panel in macOS app: Voice → Voice Recognition → Enroll My Voice
Guided 3-step flow: (1) explain feature, (2) record sample in quiet environment, (3) confirm enrollment
Visual feedback: waveform during recording, confirmation on success
Option to re-enroll at any time

Config Surface

{
  "voice": {
    "speakerRecognition": {
      "enabled": true,         // default false
      "model": "on-device",   // "on-device" | "provider-diarization"
      "provider": "deepgram", // used if model === "provider-diarization"
      "threshold": 0.75,       // cosine similarity threshold for on-device
      "enrollmentSample": "<path or base64>",
      "enrolledAt": "2026-01-01T00:00:00Z"
    }
  }
}

VAD Integration

Speaker recognition should run on top of the existing VAD pipeline
Only accepted segments (matching enrolled speaker) should trigger the wake-word/silence-send logic
Rejected segments are dropped silently

Fallback Behavior

If enrollment is not complete or model is unavailable, fall back to existing behavior (wake word / silence timeout only)
A warning in the Voice Wake overlay if speaker recognition is enabled but not yet enrolled

Environment

macOS (Voice Wake + Push-to-Talk + Talk mode — primary use case)
iOS / Android (Talk mode) — native OS voice frameworks may offer built-in solutions
OpenClaw version: 2026.5.27+

Priority

Medium — valuable for real-world voice usability, but existing workarounds (push-to-talk, wake-word silence window) mitigate the most urgent cases.

Requested by: iqbalbhawana / Commander Iqbal Bhawana

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering