hermes - ✅(Solved) Fix Hindsight tool calls freeze session when internal LLM errors out [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17403Fetched 2026-04-30 06:47:50
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Participants
Timeline (top)
labeled ×4referenced ×2cross-referenced ×1

Error Message

╭────────────────────── Starting Daemon (hermes @ :9177) ──────────────────────╮ │ | RuntimeError: Database migration failed │ │ +------------------------------------ │ │ ERROR: Application startup failed. Exiting. │ │ ⏳ Waiting for daemon... (177s elapsed) │ ╰──────────────────────────────────────────────────────────────────────────────╯ ╭───────────────── ✗ Daemon Failed (Timeout) (hermes @ :9177) ─────────────────╮ │ ✗ Daemon failed to start (timeout) │ │ See full log: /home/dev/.hindsight/profiles/hermes.log │ ╰──────────────────────────────────────────────────────────────────────────────╯

=== Daemon startup failed: Failed to start daemon for profile 'hermes' === Traceback (most recent call last): File ".../plugins/memory/hindsight/init.py", line 1025, in _start_daemon client._ensure_started() File ".../hindsight/embedded.py", line 186, in _ensure_started raise RuntimeError("Failed to start daemon for profile 'hermes'")

Root Cause

In plugins/memory/hindsight/__init__.py:

  1. _run_hindsight_operation (line 820-835) calls _run_sync() which blocks with future.result(timeout=self._timeout) (default 120s). But the daemon startup itself waits 177s before timing out, so the effective block time exceeds the configured timeout.

  2. sync_turn (line 1222-1252) spawns daemon threads that attempt client.aretain_batch()_run_hindsight_operation()_run_sync(). When the daemon is broken, every turn creates a new thread that blocks for ~177s. The 5-second join at line 1250 is insufficient:

    if self._sync_thread and self._sync_thread.is_alive():
        self._sync_thread.join(timeout=5.0)  # previous thread still blocked at 177s
    self._sync_thread = threading.Thread(target=_sync, daemon=True)
    self._sync_thread.start()
  3. No circuit breaker: There is no mechanism to detect repeated daemon failures and fast-fail subsequent attempts. Every call retries the full startup sequence.

  4. _is_retriable_embedded_connection_error (line 805-818) only matches connection-level errors, not "Cannot use HindsightEmbedded after it has been closed" or "Failed to start daemon", so _run_hindsight_operation does not retry — it just propagates the exception after the full timeout.

Fix Action

Fixed

PR fix notes

PR #17416: fix(hindsight): add circuit breaker to prevent session freeze on daemon failure

Description (problem / solution / changelog)

Summary

Fixes #17403.

When the embedded Hindsight daemon fails to start (e.g., database migration error), every tool call (hindsight_retain, hindsight_recall, hindsight_reflect) and every background sync_turn blocks for ~177 seconds waiting for the daemon startup timeout. With no circuit breaker, this repeats on every interaction, making the session appear frozen.

Changes

Circuit breaker in _run_hindsight_operation() (plugins/memory/hindsight/__init__.py):

  • Tracks consecutive operation failures
  • After 3 consecutive failures (_CIRCUIT_BREAKER_THRESHOLD), raises immediately with a descriptive error instead of blocking
  • Resets after a 60s cooldown (_CIRCUIT_BREAKER_COOLDOWN) — half-open state allows one retry
  • Resets to 0 on any successful operation

Expanded _is_retriable_embedded_connection_error():

  • Added 'failed to start daemon' and 'after it has been closed' markers
  • This allows a single client-recreation retry before the circuit opens (matching existing idle-shutdown recovery pattern)

sync_turn protection:

  • Checks circuit breaker before spawning retain thread
  • Prevents accumulation of blocked daemon threads

Testing

  • 5 new tests in tests/plugins/memory/test_hindsight_circuit_breaker.py:
    • Circuit opens after threshold failures
    • Circuit resets after cooldown (half-open)
    • Success resets failure counter
    • sync_turn skips when circuit is open
    • New retriable error markers recognized
  • All 132 existing tests/plugins/memory/ tests pass

Behaviour

StateBeforeAfter
Daemon broken, 1st callBlocks ~177s, raisesBlocks ~177s, raises
Daemon broken, 2nd callBlocks ~177s, raisesBlocks ~177s, raises
Daemon broken, 3rd callBlocks ~177s, raisesBlocks ~177s, raises (opens circuit)
Daemon broken, 4th+ callBlocks ~177s, raisesRaises immediately (~0ms)
After 60s cooldownBlocks ~177s (repeat)Allows one retry (half-open)
Daemon recoversN/ACounter resets to 0, normal operation

Changed files

  • plugins/memory/hindsight/__init__.py (modified, +35/-4)
  • tests/plugins/memory/test_hindsight_circuit_breaker.py (added, +91/-0)

Code Example

╭────────────────────── Starting Daemon (hermes @ :9177) ──────────────────────╮
| RuntimeError: Database migration failed                             │
+------------------------------------ERROR:    Application startup failed. Exiting.                              
│  ⏳ Waiting for daemon... (177s elapsed)╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────── ✗ Daemon Failed (Timeout) (hermes @ :9177) ─────────────────╮
│  ✗ Daemon failed to start (timeout)See full log: /home/dev/.hindsight/profiles/hermes.log╰──────────────────────────────────────────────────────────────────────────────╯

=== Daemon startup failed: Failed to start daemon for profile 'hermes' ===
Traceback (most recent call last):
  File ".../plugins/memory/hindsight/__init__.py", line 1025, in _start_daemon
    client._ensure_started()
  File ".../hindsight/embedded.py", line 186, in _ensure_started
    raise RuntimeError("Failed to start daemon for profile 'hermes'")

---

2026-04-29 08:31:11,146 WARNING plugins.memory.hindsight: Hindsight sync failed: Cannot use HindsightEmbedded after it has been closed
2026-04-29 08:34:11,414 WARNING plugins.memory.hindsight: Hindsight sync failed: Failed to start daemon for profile 'hermes'
2026-04-29 08:37:11,582 WARNING plugins.memory.hindsight: Hindsight sync failed: Cannot use HindsightEmbedded after it has been closed
... (continues alternating between the two errors)
2026-04-29 09:31:17,721 WARNING plugins.memory.hindsight: hindsight_retain failed: Failed to start daemon for profile 'hermes'

---

if self._sync_thread and self._sync_thread.is_alive():
       self._sync_thread.join(timeout=5.0)  # previous thread still blocked at 177s
   self._sync_thread = threading.Thread(target=_sync, daemon=True)
   self._sync_thread.start()
RAW_BUFFERClick to expand / collapse

Bug Description

When Hindsights embedded daemon fails to start (e.g., database migration failure), every tool call (hindsight_retain, hindsight_recall, hindsight_reflect) and every background sync_turn blocks for ~177 seconds waiting for the daemon startup timeout. The session appears frozen during this time.

The error handling catches the exception and returns it, but the blocking timeout is so long that the user experience is a complete freeze.

Environment

  • OS: Ubuntu 22.04.5 LTS (x86_64)
  • Python: 3.11.14
  • Hermes Agent: v0.11.0 (2026.4.23)
  • Hindsight mode: local_embedded
  • Hindsight LLM provider: openai_compatible (https://opencode.ai/zen/go/v1, model: deepseek-v4-flash)
  • hindsight-client/hindsight-all: NOT installed (only embedded runtime available)
  • Config: ~/.hermes/hindsight/config.json

Steps to Reproduce

  1. Configure Hindsight in local_embedded mode with a broken daemon state (e.g., corrupted DB causing RuntimeError: Database migration failed)
  2. Start a Hermes session
  3. Send any message — the session will appear frozen for ~177 seconds while the daemon startup times out
  4. After timeout, the error surfaces: "Failed to start daemon for profile hermes"
  5. Every subsequent message triggers the same 177-second block

Error Traceback

From ~/.hermes/logs/hindsight-embed.log:

╭────────────────────── Starting Daemon (hermes @ :9177) ──────────────────────╮
│        | RuntimeError: Database migration failed                             │
│        +------------------------------------                                 │
│  ERROR:    Application startup failed. Exiting.                              │
│  ⏳ Waiting for daemon... (177s elapsed)                                     │
╰──────────────────────────────────────────────────────────────────────────────╯
╭───────────────── ✗ Daemon Failed (Timeout) (hermes @ :9177) ─────────────────╮
│  ✗ Daemon failed to start (timeout)                                          │
│  See full log: /home/dev/.hindsight/profiles/hermes.log                      │
╰──────────────────────────────────────────────────────────────────────────────╯

=== Daemon startup failed: Failed to start daemon for profile 'hermes' ===
Traceback (most recent call last):
  File ".../plugins/memory/hindsight/__init__.py", line 1025, in _start_daemon
    client._ensure_started()
  File ".../hindsight/embedded.py", line 186, in _ensure_started
    raise RuntimeError("Failed to start daemon for profile 'hermes'")

From ~/.hermes/logs/errors.log (repeating every ~3 minutes, 30+ occurrences today):

2026-04-29 08:31:11,146 WARNING plugins.memory.hindsight: Hindsight sync failed: Cannot use HindsightEmbedded after it has been closed
2026-04-29 08:34:11,414 WARNING plugins.memory.hindsight: Hindsight sync failed: Failed to start daemon for profile 'hermes'
2026-04-29 08:37:11,582 WARNING plugins.memory.hindsight: Hindsight sync failed: Cannot use HindsightEmbedded after it has been closed
... (continues alternating between the two errors)
2026-04-29 09:31:17,721 WARNING plugins.memory.hindsight: hindsight_retain failed: Failed to start daemon for profile 'hermes'

Root Cause

In plugins/memory/hindsight/__init__.py:

  1. _run_hindsight_operation (line 820-835) calls _run_sync() which blocks with future.result(timeout=self._timeout) (default 120s). But the daemon startup itself waits 177s before timing out, so the effective block time exceeds the configured timeout.

  2. sync_turn (line 1222-1252) spawns daemon threads that attempt client.aretain_batch()_run_hindsight_operation()_run_sync(). When the daemon is broken, every turn creates a new thread that blocks for ~177s. The 5-second join at line 1250 is insufficient:

    if self._sync_thread and self._sync_thread.is_alive():
        self._sync_thread.join(timeout=5.0)  # previous thread still blocked at 177s
    self._sync_thread = threading.Thread(target=_sync, daemon=True)
    self._sync_thread.start()
  3. No circuit breaker: There is no mechanism to detect repeated daemon failures and fast-fail subsequent attempts. Every call retries the full startup sequence.

  4. _is_retriable_embedded_connection_error (line 805-818) only matches connection-level errors, not "Cannot use HindsightEmbedded after it has been closed" or "Failed to start daemon", so _run_hindsight_operation does not retry — it just propagates the exception after the full timeout.

Expected Behavior

  • When the daemon fails to start, subsequent tool calls should fail fast (not block for 177s)
  • A circuit breaker should prevent repeated daemon startup attempts after consecutive failures
  • The user should see an error message within a few seconds, not after a 2-3 minute freeze

Proposed Fix

  1. Circuit breaker in _get_client(): Track consecutive startup failures. After N failures (e.g., 3), return an error immediately without attempting daemon startup. Reset after a cooldown period.

  2. Cap daemon startup timeout: The 177s daemon startup wait should be capped to match _DEFAULT_TIMEOUT (120s) or a shorter value (e.g., 30s for tool calls).

  3. In sync_turn: Check circuit breaker state before spawning a new thread. If the daemon is known-broken, skip the retain silently (log a warning once, not every turn).

  4. Fast-fail for known-broken state: When _run_hindsight_operation catches "Cannot use HindsightEmbedded after it has been closed", it should reset self._client = None and check the circuit breaker before retrying.

Related Issues

  • #14950 (Per-operation timeout configuration for Hindsight memory provider)
  • #17226 (RuntimeError during aretain_batch when background event loop is replaced)
  • #13125 (local_embedded causes infinite daemon crash-restart loop)

extent analysis

TL;DR

Implement a circuit breaker in _get_client() to track consecutive daemon startup failures and return an error immediately after a specified number of failures.

Guidance

  • Introduce a circuit breaker mechanism to prevent repeated daemon startup attempts after consecutive failures.
  • Cap the daemon startup timeout to match _DEFAULT_TIMEOUT (120s) or a shorter value (e.g., 30s) to prevent long blocking times.
  • Modify sync_turn to check the circuit breaker state before spawning a new thread and skip the retain operation if the daemon is known to be broken.
  • Update _run_hindsight_operation to fast-fail when catching "Cannot use HindsightEmbedded after it has been closed" and reset self._client = None before retrying.

Example

def _get_client(self):
    # Circuit breaker implementation
    if self._consecutive_failures >= 3:
        # Return an error immediately after 3 consecutive failures
        raise RuntimeError("Daemon startup failed after consecutive attempts")
    # ... existing code ...

def sync_turn(self):
    # Check circuit breaker state before spawning a new thread
    if self._circuit_breaker.is_broken:
        # Log a warning and skip the retain operation
        logging.warning("Daemon is known to be broken, skipping retain operation")
        return
    # ... existing code ...

Notes

The proposed fix requires modifications to the existing codebase, including introducing a circuit breaker mechanism and updating the sync_turn and _run_hindsight_operation methods. The exact implementation details may vary depending on the specific requirements and constraints of the project.

Recommendation

Apply the proposed fix, including the circuit breaker mechanism and updates to sync_turn and _run_hindsight_operation, to prevent repeated daemon startup attempts and long blocking times. This should improve the overall user experience by providing faster error handling and preventing freezes.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Hindsight tool calls freeze session when internal LLM errors out [1 pull requests, 1 participants]