hermes - ✅(Solved) Fix Gateway turn can silently stop after session_search timeout / initial tools while active_agents=0 [2 pull requests, 1 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#23081Fetched 2026-05-11 03:31:25
View on GitHub
Comments
1
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×2commented ×1

A Telegram gateway turn can appear to stall/stop after session_search auxiliary timeouts or after initial tool calls. The gateway stays healthy (running, Telegram connected), but the active chat has no response, active_agents returns to 0, and no tool/model subprocess is running.

Observed locally on Hermes Agent v0.13.0 (2026.5.7), commit 498bfc7, running from source on macOS.

Error Message

WARNING agent.auxiliary_client: Auxiliary session_search: connection error on auto and no fallback available (tried: openrouter, nous, local/custom, api-key) WARNING root: Session summarization failed after 3 attempts: Codex auxiliary Responses stream exceeded 30.0s total timeout Traceback (most recent call last): File ".../tools/session_search_tool.py", line 228, in _summarize_session response = await async_call_llm( File ".../agent/auxiliary_client.py", line 4018, in async_call_llm await client.chat.completions.create(**kwargs), task) File ".../agent/auxiliary_client.py", line 853, in create return await asyncio.to_thread(self._sync.create, **kwargs) File ".../asyncio/threads.py", line 25, in to_thread return await loop.run_in_executor(None, func_call) TimeoutError: Codex auxiliary Responses stream exceeded 30.0s total timeout

Root Cause

Possible root cause area

Fix Action

Fix / Workaround

WARNING gateway.platforms.base: [Telegram] Cancelled task for agent:main:telegram:dm:378478304 did not exit within 5s; unblocking dispatch and letting the task unwind in the background

PR fix notes

PR #2: fix(gateway): surface incomplete tool turns

Description (problem / solution / changelog)

What does this PR do?

Fixes a gateway silent-stop class where a turn can become inactive after tool activity without delivering a final user-visible response. The gateway now turns incomplete, non-interrupted tool-tail results into an explicit failure message so the user is not left with active_agents=0 and no reply.

It also makes session_search summarization bridge timeouts degrade to raw transcript previews instead of returning a failed tool result. That keeps recall best-effort and lets the main agent continue when auxiliary summarization is slow or unavailable.

Related Issue

Related: https://github.com/NousResearch/hermes-agent/issues/23081

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py: normalize incomplete, non-interrupted tool-tail turns into a visible gateway response.
  • tools/session_search_tool.py: return successful degraded raw-preview results when summarization bridge execution times out.
  • tests/gateway/test_duplicate_reply_suppression.py: cover incomplete tool-tail response normalization and preserve interrupted stop flow behavior.
  • tests/tools/test_session_search.py: cover session_search timeout degradation to raw previews.

How to Test

  1. Run focused gateway/session-search regression tests:
    scripts/run_tests.sh tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
  2. Compile changed Python files:
    python3 -m py_compile gateway/run.py tools/session_search_tool.py tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
  3. Check whitespace and Windows footguns for this diff:
    git diff --check
    python3 scripts/check-windows-footguns.py --diff origin/main

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS Darwin arm64

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

Focused validation passed:

scripts/run_tests.sh tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
63 passed

Additional checks passed:

python3 -m py_compile gateway/run.py tools/session_search_tool.py tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
git diff --check
python3 scripts/check-windows-footguns.py --diff origin/main

Changed files

  • gateway/run.py (modified, +27/-0)
  • tests/gateway/test_duplicate_reply_suppression.py (modified, +46/-0)
  • tests/tools/test_session_search.py (modified, +43/-0)
  • tools/session_search_tool.py (modified, +17/-7)

PR #23214: fix(gateway): surface incomplete tool turns

Description (problem / solution / changelog)

What does this PR do?

Fixes a gateway silent-stop class where a turn can become inactive after tool activity without delivering a final user-visible response. The gateway now turns incomplete, non-interrupted tool-tail results into an explicit failure message so the user is not left with active_agents=0 and no reply.

It also makes session_search summarization bridge timeouts degrade to raw transcript previews instead of returning a failed tool result. That keeps recall best-effort and lets the main agent continue when auxiliary summarization is slow or unavailable.

Related Issue

Related: https://github.com/NousResearch/hermes-agent/issues/23081

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 🔒 Security fix
  • 📝 Documentation update
  • ✅ Tests (adding or improving test coverage)
  • ♻️ Refactor (no behavior change)
  • 🎯 New skill (bundled or hub)

Changes Made

  • gateway/run.py: normalize incomplete, non-interrupted tool-tail turns into a visible gateway response.
  • tools/session_search_tool.py: return successful degraded raw-preview results when summarization bridge execution times out.
  • tests/gateway/test_duplicate_reply_suppression.py: cover incomplete tool-tail response normalization and preserve interrupted stop flow behavior.
  • tests/tools/test_session_search.py: cover session_search timeout degradation to raw previews.

How to Test

  1. Run focused gateway/session-search regression tests:
    scripts/run_tests.sh tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
  2. Compile changed Python files:
    python3 -m py_compile gateway/run.py tools/session_search_tool.py tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
  3. Check whitespace and Windows footguns for this diff:
    git diff --check
    python3 scripts/check-windows-footguns.py --diff origin/main

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits (fix(scope):, feat(scope):, etc.)
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix/feature (no unrelated commits)
  • I've run pytest tests/ -q and all tests pass
  • I've added tests for my changes (required for bug fixes, strongly encouraged for features)
  • I've tested on my platform: macOS Darwin arm64

Documentation & Housekeeping

  • I've updated relevant documentation (README, docs/, docstrings) — or N/A
  • I've updated cli-config.yaml.example if I added/changed config keys — or N/A
  • I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows — or N/A
  • I've considered cross-platform impact (Windows, macOS) per the compatibility guide — or N/A
  • I've updated tool descriptions/schemas if I changed tool behavior — or N/A

For New Skills

N/A

Screenshots / Logs

Focused validation passed:

scripts/run_tests.sh tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
63 passed

Additional checks passed:

python3 -m py_compile gateway/run.py tools/session_search_tool.py tests/gateway/test_duplicate_reply_suppression.py tests/tools/test_session_search.py
git diff --check
python3 scripts/check-windows-footguns.py --diff origin/main

Changed files

  • gateway/run.py (modified, +27/-0)
  • tests/gateway/test_duplicate_reply_suppression.py (modified, +46/-0)
  • tests/tools/test_session_search.py (modified, +43/-0)
  • tools/session_search_tool.py (modified, +17/-7)

Code Example

2026-05-10 05:09:53 inbound msg='Olá'
2026-05-10 05:09:57 response ready ... time=3.8s api_calls=1

---

2026-05-10 05:10:26 inbound msg='Quero que você corrija/finalize/teste/valide o plugin+engine+documentação do con...'

---

gateway_state=running
active_agents=0
telegram=connected
session_id=20260510_045801_bc737063
updated_at=2026-05-10T05:10:26.576363
last_prompt_tokens=18435

---

WARNING agent.auxiliary_client: Auxiliary session_search: connection error on auto and no fallback available (tried: openrouter, nous, local/custom, api-key)
WARNING root: Session summarization failed after 3 attempts: Codex auxiliary Responses stream exceeded 30.0s total timeout
Traceback (most recent call last):
  File ".../tools/session_search_tool.py", line 228, in _summarize_session
    response = await async_call_llm(
  File ".../agent/auxiliary_client.py", line 4018, in async_call_llm
    await client.chat.completions.create(**kwargs), task)
  File ".../agent/auxiliary_client.py", line 853, in create
    return await asyncio.to_thread(self._sync.create, **kwargs)
  File ".../asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
TimeoutError: Codex auxiliary Responses stream exceeded 30.0s total timeout

---

WARNING gateway.platforms.base: [Telegram] Cancelled task for agent:main:telegram:dm:378478304 did not exit within 5s; unblocking dispatch and letting the task unwind in the background

---

assistant: Vou fazer isso de ponta a ponta sem restart inicialmente...
tool_calls: todo(...)
tool: todo -> 7 tasks
tool_calls: skill_view(...), search_files(...)
tool: skill_view context-replace-live-debugging.md
tool: skill_view context-replace-customization.md
tool: search_files(...)
tool_calls: read_file(...), read_file(...), search_files(...), search_files(...)
RAW_BUFFERClick to expand / collapse

Summary

A Telegram gateway turn can appear to stall/stop after session_search auxiliary timeouts or after initial tool calls. The gateway stays healthy (running, Telegram connected), but the active chat has no response, active_agents returns to 0, and no tool/model subprocess is running.

Observed locally on Hermes Agent v0.13.0 (2026.5.7), commit 498bfc7, running from source on macOS.

Environment

  • Hermes Agent: v0.13.0 (2026.5.7)
  • Commit: 498bfc7
  • OS: macOS Darwin arm64
  • Gateway mode: hermes gateway run --replace --accept-hooks under LaunchAgent
  • Platform: Telegram polling
  • Model/provider: openai-codex/gpt-5.5
  • Relevant config:
    • agent.max_turns: 1000
    • model.context_length: 272000
    • compression.threshold: 0.65

What happened

After a gateway restart, a short sanity message worked:

2026-05-10 05:09:53 inbound msg='Olá'
2026-05-10 05:09:57 response ready ... time=3.8s api_calls=1

Then a larger task was sent:

2026-05-10 05:10:26 inbound msg='Quero que você corrija/finalize/teste/valide o plugin+engine+documentação do con...'

After that, the gateway remained healthy but the turn did not continue to completion:

gateway_state=running
active_agents=0
telegram=connected
session_id=20260510_045801_bc737063
updated_at=2026-05-10T05:10:26.576363
last_prompt_tokens=18435

No relevant codex, acpx, execute_code, terminal, or tool subprocess was still running.

Earlier related symptom in same session

Before the restart, asking for context triggered session_search repeatedly and it timed out:

WARNING agent.auxiliary_client: Auxiliary session_search: connection error on auto and no fallback available (tried: openrouter, nous, local/custom, api-key)
WARNING root: Session summarization failed after 3 attempts: Codex auxiliary Responses stream exceeded 30.0s total timeout
Traceback (most recent call last):
  File ".../tools/session_search_tool.py", line 228, in _summarize_session
    response = await async_call_llm(
  File ".../agent/auxiliary_client.py", line 4018, in async_call_llm
    await client.chat.completions.create(**kwargs), task)
  File ".../agent/auxiliary_client.py", line 853, in create
    return await asyncio.to_thread(self._sync.create, **kwargs)
  File ".../asyncio/threads.py", line 25, in to_thread
    return await loop.run_in_executor(None, func_call)
TimeoutError: Codex auxiliary Responses stream exceeded 30.0s total timeout

There was also a visible warning after cancelling/stopping a previous turn:

WARNING gateway.platforms.base: [Telegram] Cancelled task for agent:main:telegram:dm:378478304 did not exit within 5s; unblocking dispatch and letting the task unwind in the background

Session transcript evidence

The JSON session store showed that the agent began the larger task, loaded skills, created a todo, and started reading files/searching. Then it stopped without completing/reporting, while active_agents was already 0:

assistant: Vou fazer isso de ponta a ponta sem restart inicialmente...
tool_calls: todo(...)
tool: todo -> 7 tasks
tool_calls: skill_view(...), search_files(...)
tool: skill_view context-replace-live-debugging.md
tool: skill_view context-replace-customization.md
tool: search_files(...)
tool_calls: read_file(...), read_file(...), search_files(...), search_files(...)

The compact JSONL transcript lagged behind the richer session_*.json store: the .jsonl only contained the earlier short exchange and did not include the larger task's tool sequence, while session_*.json did. This may or may not be related.

Expected behavior

One of these should happen reliably:

  1. The turn continues until response/tool failure is delivered to the user; or
  2. A failed auxiliary session_search returns a bounded tool error to the model and the agent can continue without search summaries; or
  3. The gateway marks the run failed/cancelled visibly instead of leaving the user with no response while active_agents=0.

Actual behavior

The user sees no final answer. Operational state suggests no work is running:

  • active_agents=0
  • no tool/model subprocess
  • gateway and Telegram healthy
  • session updated at inbound time, then no visible completion

Possible root cause area

Likely around one or more of:

  • tools/session_search_tool.py handling of auxiliary timeout/fallback
  • cancellation/unwind path after Cancelled task ... did not exit within 5s
  • gateway turn lifecycle marking run inactive before a terminal user-visible response/failure is persisted
  • mismatch between session_*.json and .jsonl persistence when a turn stops mid-tool sequence

Privacy

Logs above are redacted to omit message bodies beyond short non-sensitive excerpts and local user identifiers.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

One of these should happen reliably:

  1. The turn continues until response/tool failure is delivered to the user; or
  2. A failed auxiliary session_search returns a bounded tool error to the model and the agent can continue without search summaries; or
  3. The gateway marks the run failed/cancelled visibly instead of leaving the user with no response while active_agents=0.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Gateway turn can silently stop after session_search timeout / initial tools while active_agents=0 [2 pull requests, 1 comments, 1 participants]