openclaw - 💡(How to fix) Fix [Bug] Voice STT: empty moonshine transcripts passed as raw JSON to LLM, clogging serialized processing queue

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

When using moonshine-tiny-en for Discord voice STT, empty/noisy transcripts are passed as raw JSON strings to the LLM instead of being filtered out. This wastes ~4 seconds and ~24k input tokens per empty segment and clogs the serialized processing queue, making the bot appear unresponsive in voice.

Root Cause

In manager.runtime, transcribeVoiceAudio() calls normalizeOptionalString() on the STT result, which returns undefined for empty strings. However, the sherpa-onnx CLI output includes the entire JSON object on the last line, and the mediaUnderstanding.transcribeAudioFile() result appears to include the full JSON string as text even when the "text" field within it is empty.

The check at line ~1441 (if (!transcript)) catches undefined but NOT the full JSON string with an empty "text" field. So {"text": "", ...} passes through as a non-empty string transcript.

Fix Action

Workaround

Reducing captureSilenceGraceMs (from 1500 to 1000) and timeoutSeconds (from 300 to 120) helps marginally, plus periodic cleanup of stale /tmp/openclaw/discord-voice-*/segment.wav files. But the core issue is that empty transcripts should be filtered before reaching the LLM.

Code Example

Voice transcript from speaker "[CK] Alex the 'guin":
{"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}
RAW_BUFFERClick to expand / collapse

Summary

When using moonshine-tiny-en for Discord voice STT, empty/noisy transcripts are passed as raw JSON strings to the LLM instead of being filtered out. This wastes ~4 seconds and ~24k input tokens per empty segment and clogs the serialized processing queue, making the bot appear unresponsive in voice.

Reproduction

  1. Configure OpenClaw with voice.mode = "stt-tts" and moonshine-tiny-en as the STT model
  2. Join a voice channel with background noise or short utterances
  3. Observe that short/noisy audio segments produce empty transcripts: {"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}
  4. These empty JSON strings are sent to the LLM as "transcripts" instead of being filtered
  5. The LLM returns NO_REPLY (correct behavior), but each call wastes ~4s and ~24k tokens
  6. The serialized processing queue (entry.processingQueue) blocks until each call completes
  7. With ~35% of segments being empty JSON, the pipeline appears to "stop" responding

Root Cause

In manager.runtime, transcribeVoiceAudio() calls normalizeOptionalString() on the STT result, which returns undefined for empty strings. However, the sherpa-onnx CLI output includes the entire JSON object on the last line, and the mediaUnderstanding.transcribeAudioFile() result appears to include the full JSON string as text even when the "text" field within it is empty.

The check at line ~1441 (if (!transcript)) catches undefined but NOT the full JSON string with an empty "text" field. So {"text": "", ...} passes through as a non-empty string transcript.

Evidence

Session logs show:

Voice transcript from speaker "[CK] Alex the 'guin":
{"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}

100% of NO_REPLY responses (8 out of 8 in a recent session) were triggered by these empty JSON transcripts. The bot responded correctly to all real transcripts but was blocked during empty JSON processing.

52 segment files accumulated in 10 minutes. Only 10 TTS outputs were generated. The pipeline was processing empty JSON ~35% of the time.

Expected Behavior

  1. When the STT model returns "text": "" (or equivalent empty transcript), the segment should be skipped entirely — no LLM call needed
  2. The serialized processing queue should have a max depth or stale-segment discard mechanism to prevent pipeline stalls

Environment

  • OpenClaw 2026.5.18
  • sherpa-onnx moonshine-tiny-en (int8)
  • Discord voice mode: stt-tts
  • Platform: Linode 4 vCPU, 8GB RAM

Workaround

Reducing captureSilenceGraceMs (from 1500 to 1000) and timeoutSeconds (from 300 to 120) helps marginally, plus periodic cleanup of stale /tmp/openclaw/discord-voice-*/segment.wav files. But the core issue is that empty transcripts should be filtered before reaching the LLM.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug] Voice STT: empty moonshine transcripts passed as raw JSON to LLM, clogging serialized processing queue