openclaw - 💡(How to fix) Fix [Bug] Voice STT: empty moonshine transcripts passed as raw JSON to LLM, clogging serialized processing queue

openclaw2026-05-20 17:50:48

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

When using moonshine-tiny-en for Discord voice STT, empty/noisy transcripts are passed as raw JSON strings to the LLM instead of being filtered out. This wastes ~4 seconds and ~24k input tokens per empty segment and clogs the serialized processing queue, making the bot appear unresponsive in voice.

Root Cause

In manager.runtime, transcribeVoiceAudio() calls normalizeOptionalString() on the STT result, which returns undefined for empty strings. However, the sherpa-onnx CLI output includes the entire JSON object on the last line, and the mediaUnderstanding.transcribeAudioFile() result appears to include the full JSON string as text even when the "text" field within it is empty.

The check at line ~1441 (if (!transcript)) catches undefined but NOT the full JSON string with an empty "text" field. So {"text": "", ...} passes through as a non-empty string transcript.

Fix Action

Workaround

Reducing captureSilenceGraceMs (from 1500 to 1000) and timeoutSeconds (from 300 to 120) helps marginally, plus periodic cleanup of stale /tmp/openclaw/discord-voice-*/segment.wav files. But the core issue is that empty transcripts should be filtered before reaching the LLM.

Code Example

Voice transcript from speaker "[CK] Alex the 'guin":
{"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

Configure OpenClaw with voice.mode = "stt-tts" and moonshine-tiny-en as the STT model
Join a voice channel with background noise or short utterances
Observe that short/noisy audio segments produce empty transcripts: {"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}
These empty JSON strings are sent to the LLM as "transcripts" instead of being filtered
The LLM returns NO_REPLY (correct behavior), but each call wastes ~4s and ~24k tokens
The serialized processing queue (entry.processingQueue) blocks until each call completes
With ~35% of segments being empty JSON, the pipeline appears to "stop" responding

Root Cause

The check at line ~1441 (if (!transcript)) catches undefined but NOT the full JSON string with an empty "text" field. So {"text": "", ...} passes through as a non-empty string transcript.

Evidence

Session logs show:

Voice transcript from speaker "[CK] Alex the 'guin":
{"lang": "", "emotion": "", "event": "", "text": "", "timestamps": [], "durations": [], "tokens":[], "ys_log_probs": [], "words": []}

100% of NO_REPLY responses (8 out of 8 in a recent session) were triggered by these empty JSON transcripts. The bot responded correctly to all real transcripts but was blocked during empty JSON processing.

52 segment files accumulated in 10 minutes. Only 10 TTS outputs were generated. The pipeline was processing empty JSON ~35% of the time.

Expected Behavior

When the STT model returns "text": "" (or equivalent empty transcript), the segment should be skipped entirely — no LLM call needed
The serialized processing queue should have a max depth or stale-segment discard mechanism to prevent pipeline stalls

Environment

OpenClaw 2026.5.18
sherpa-onnx moonshine-tiny-en (int8)
Discord voice mode: stt-tts
Platform: Linode 4 vCPU, 8GB RAM

Workaround

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix [Bug] Voice STT: empty moonshine transcripts passed as raw JSON to LLM, clogging serialized processing queue

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Fix Action

Workaround

Code Example

Summary

Reproduction

Root Cause

Evidence

Expected Behavior

Environment

Workaround

Still need to ship something?

TRENDING