hermes - ✅(Solved) Fix [Feature]: Expose OpenAI TTS `instructions` field on the text_to_speech tool [1 pull requests, 1 participants]

hermes2026-04-22 23:29:04

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#14196•Fetched 2026-04-23 07:46:10

View on GitHub

Comments

Participants

Timeline

Reactions

Author

0xAlcibiades

Participants

0xAlcibiades

Timeline (top)

labeled ×3cross-referenced ×1referenced ×1

Root Cause

The default OpenAI TTS model Hermes already uses for the openai provider — gpt-4o-mini-tts — supports exactly this via an instructions parameter on audio.speech.create() (OpenAI voice-design guide). It is the primary mechanism for getting expressive speech out of that model. Today Hermes uses the model in its least-capable mode because _generate_openai_tts hard-codes the request kwargs and silently drops any style direction.

Fix Action

Fixed

Fixed by PR: feat(tools): forward OpenAI TTS instructions field through text_to_speech (https://github.com/NousResearch/hermes-agent/pull/14205)

PR fix notes

PR #14205: feat(tools): forward OpenAI TTS `instructions` field through text_to_speech

Repository: NousResearch/hermes-agent
Author: 0xAlcibiades
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/14205

Description (problem / solution / changelog)

Closes #14196.

Summary

Adds an optional instructions parameter to the text_to_speech tool schema and threads it through text_to_speech_tool() → _generate_openai_tts() → client.audio.speech.create().
The kwarg is forwarded only when truthy, so tts-1 / tts-1-hd and strict OpenAI-compatible servers that reject unknown kwargs are unaffected.
Unlocks gpt-4o-mini-tts's voice-design capability (tone, emotion, pacing, accent, whispering) — already supported by the model but previously unreachable through the tool.
Same passthrough enables self-hosted OpenAI-compatible voice-design servers (Qwen3-TTS-VoiceDesign via oMLX, etc.) that are already wired in through tts.openai.base_url — the established convention per #9004.
No new provider, no new toolset, no new config key.

Docs: voice-mode.md updated with a short note under the tts.openai block and a line in the TTS Provider Comparison.

Why

gpt-4o-mini-tts is Hermes's default model on the openai TTS provider, and its instructions field (OpenAI docs) is the headline quality lever on that model. The hard-coded kwarg list in _generate_openai_tts silently dropped any style direction, so every reply in voice mode got the same flat default regardless of what was being said. This PR is the minimum change to expose that capability.

Test plan

New file tests/tools/test_tts_instructions.py (6 tests) covers:

_generate_openai_tts forwards instructions to audio.speech.create when provided
instructions key absent from create kwargs when not provided (regression guard for tts-1 / strict servers)
Empty-string instructions omitted (treated as absent)
Tool-level threading: text_to_speech_tool(instructions=...) → backend sees it
Tool-level omission when arg not supplied
Schema declares instructions as optional string, not in required

Existing tests/tools/test_tts_max_text_length.py fake helper widened to accept the new kwarg.

Run locally:

uv run scripts/run_tests.sh tests/tools/test_tts_instructions.py tests/tools/test_tts_speed.py tests/tools/test_tts_max_text_length.py
# 50 passed

Manual verification against gpt-4o-mini-tts: same text with vs. without instructions="Whisper conspiratorially" produces audibly different output. Also verified against a self-hosted oMLX server serving Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16 via tts.openai.base_url override.

Platforms tested

macOS (Darwin 25.2)

No platform-specific code touched; change is pure kwarg passthrough in the existing OpenAI client call.

Changed files

tests/tools/test_tts_instructions.py (added, +124/-0)
tests/tools/test_tts_max_text_length.py (modified, +2/-2)
tools/tts_tool.py (modified, +31/-3)
website/docs/user-guide/features/voice-mode.md (modified, +11/-0)

Code Example

# 200 OK → WAV audio returned
curl -X POST "$BASE_URL/v1/audio/speech" \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{
    "model":"Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16",
    "voice":"default",
    "input":"hello",
    "instructions":"Speak in a cheerful and positive tone."
  }'

---

"instructions": {
    "type": "string",
    "description": (
        "Optional voice-design guidance: tone, emotion, pacing, accent, "
        "whispering, impressions. Forwarded to the OpenAI TTS backend "
        "(gpt-4o-mini-tts and OpenAI-compatible servers). Silently "
        "ignored by backends that don't support it."
    ),
},

---

if instructions:
    create_kwargs["instructions"] = instructions

RAW_BUFFERClick to expand / collapse

Problem or Use Case

Hermes's text_to_speech tool schema currently accepts only text and output_path. There is no way — per call or per config — for the agent to control tone, emotion, accent, pacing, or whispering on spoken replies. In voice mode (CLI, Telegram, Discord VC), every reply gets the same flat default voice regardless of whether the content is a somber correction, an excited announcement, a spooky story, or a calming instruction.

Secondarily, the same fix benefits users running local OpenAI-compatible TTS servers (Kokoro, StyleTTS2, oMLX hosting Qwen3-TTS-VoiceDesign, etc.) through the openai backend via tts.openai.base_url override — the established convention per #9004 and the config docs. Several of these servers accept the same instructions field; some (Qwen3-TTS-VoiceDesign) require it.

Verified wire shape

Against a self-hosted oMLX server serving Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16. Field name is the OpenAI-spec plural instructions, identical to OpenAI's own API:

# 200 OK → WAV audio returned
curl -X POST "$BASE_URL/v1/audio/speech" \
  -H "Authorization: Bearer $KEY" -H "Content-Type: application/json" \
  -d '{
    "model":"Qwen3-TTS-12Hz-1.7B-VoiceDesign-bf16",
    "voice":"default",
    "input":"hello",
    "instructions":"Speak in a cheerful and positive tone."
  }'

Proposed Solution

Small, backend-agnostic change scoped to the openai TTS backend. The OpenAI Python SDK already accepts instructions= on audio.speech.create(); Hermes just needs to expose it.

1. Add the parameter to the text_to_speech tool schema:

"instructions": {
    "type": "string",
    "description": (
        "Optional voice-design guidance: tone, emotion, pacing, accent, "
        "whispering, impressions. Forwarded to the OpenAI TTS backend "
        "(gpt-4o-mini-tts and OpenAI-compatible servers). Silently "
        "ignored by backends that don't support it."
    ),
},

2. Forward the value in _generate_openai_tts, omitting the key when empty so tts-1 / tts-1-hd and strict servers are unaffected:

if instructions:
    create_kwargs["instructions"] = instructions

3. Thread the arg through text_to_speech_tool() and the registry handler lambda. No new provider, no new toolset, no new config key.

4. Document the new argument in the TTS section of voice-mode.md with a pointer to OpenAI's voice-design guidance.

This follows the same shape as #6926 (forwarding a native OpenAI TTS kwarg through the same code path).

Why it matters

Expressive voice output. The agent matches delivery to content — somber when reporting a failure, excited when announcing completion, whispered for asides. Today all of this flattens to the same voice.
Zero configuration burden for users. The model picks per-utterance style from what it's about to say; no user config required.
Unlocks capability already being paid for. gpt-4o-mini-tts is Hermes's default for the openai provider; instructions is its headline feature.
Self-hosted voice-design models become first-class. Qwen3-TTS-VoiceDesign on oMLX, and any other OpenAI-compatible server that implements instructions, work via the existing tts.openai.base_url override with no new provider code.

Test Plan

Mirrors the mock pattern in tests/tools/test_tts_speed.py:

Tool called with instructions="..." → mocked client.audio.speech.create receives instructions=... in kwargs
Tool called without instructions → key is absent from kwargs (preserves behavior on tts-1, tts-1-hd, strict servers)
Non-openai backends (edge, elevenlabs, gemini, xai, minimax, mistral, neutts, kittentts) silently ignore the arg
Manual verification: same text through gpt-4o-mini-tts with vs. without instructions="Whisper conspiratorially" produces audibly different output

Out of scope

Mapping instructions onto other backends' native style controls (Gemini inline prompt prefix, ElevenLabs voice_settings, Mistral). Separate follow-ups.
tts.openai.instructions config-level default. Layerable on top of the tool arg later.

#6926 — sibling: wires another native OpenAI TTS kwarg (speed) through the same backend
#9004 — precedent: Kokoro/StyleTTS2 treated as OpenAI-compatible via the existing openai backend
#6566 — tangential: tts.openai.base_url validation, same config path this feature relies on

extent analysis

TL;DR

Add an instructions parameter to the text_to_speech tool schema and forward its value to the OpenAI TTS backend to enable expressive voice output.

Guidance

Update the text_to_speech tool schema to include the instructions parameter with a string type and a description of its purpose.
Modify the _generate_openai_tts function to forward the instructions value to the OpenAI TTS backend, omitting the key when empty.
Thread the instructions argument through the text_to_speech_tool function and the registry handler lambda.
Document the new instructions argument in the TTS section of the voice-mode.md documentation.

Example

"instructions": {
    "type": "string",
    "description": (
        "Optional voice-design guidance: tone, emotion, pacing, accent, "
        "whispering, impressions. Forwarded to the OpenAI TTS backend "
        "(gpt-4o-mini-tts and OpenAI-compatible servers). Silently "
        "ignored by backends that don't support it."
    ),
}

Notes

This solution only applies to the openai TTS backend and OpenAI-compatible servers. Other backends will silently ignore the instructions argument.

Recommendation

Apply the proposed solution to add the instructions parameter to the text_to_speech tool schema and forward its value to the OpenAI TTS backend. This will enable expressive voice output without requiring any additional configuration from users.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #training loop #device allocation #model download #tokenizer error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.