hermes - ✅(Solved) Fix [Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime [1 pull requests, 3 comments, 2 participants]

hermes2026-04-29 05:20:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#17283•Fetched 2026-04-30 06:48:40

View on GitHub

Comments

Participants

Timeline

Reactions

Author

DUMPDUMPY

Participants

alt-glitch

DUMPDUMPY

Timeline (top)

labeled ×5commented ×3closed ×1cross-referenced ×1

In a long-running gateway process (hermes gateway run for ~7+ hours), webhook routes configured with skills: start producing a 42-character stub ([Failed to load skill: <name>]) that silently overwrites the user's webhook payload. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.

Error Message

agent/skill_commands.py:_load_skill_payload

try: loaded_skill = ... except Exception as e: logger.warning("_load_skill_payload failed: %s", e, exc_info=True) return None

if not loaded_skill.get("success"): logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill) return None

Root Cause

Suspected root cause (not verified)

Fix Action

Fix / Workaround

Module-level cache in agent/skill_commands.py (or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks _load_skill_payload. The bare except Exception: return None at lines 147-148 swallows the underlying exception silently.

Project-side mitigation already deployed

PR fix notes

PR #17330: fix(skills): return None instead of truthy stub when skill load fails (#17283)

Repository: NousResearch/hermes-agent
Author: Bartok9
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/17330

Description (problem / solution / changelog)

Summary

Fixes #17283 — webhook gateway skill auto-loader returns truthy stub string instead of None on load failure, silently overwriting the user's prompt.

Root Cause

build_skill_invocation_message() returned "[Failed to load skill: <name>]" when _load_skill_payload() failed. The webhook handler at gateway/platforms/webhook.py:401-405 checks if skill_content: and substitutes the prompt — a truthy stub passes that check, replacing the real user payload with a 42-character stub.

Additionally, _load_skill_payload() caught all exceptions with a bare except Exception: return None and logged nothing, making the failure invisible in gateway logs.

Fix

Return None from build_skill_invocation_message() when payload loading fails (consistent with the unknown-command path that already returns None)
Add logger.warning() calls in _load_skill_payload() for both exception and success=false paths so failures are visible in logs
Regression test verifying None is returned (not a truthy string) when payload load fails

Why this works

The webhook caller already handles None correctly — if skill_content: is False for None, so the original user prompt is preserved. The fix aligns the failure path with what callers expect.

Testing

$ python3 -m pytest tests/agent/test_skill_commands.py -xvs -o 'addopts='
======================= 36 passed in 1.58s =======================

All 36 existing tests pass + 1 new regression test.

Changed files

agent/skill_commands.py (modified, +4/-1)
tests/agent/test_skill_commands.py (modified, +19/-0)

Code Example

| Timestamp UTC                  | Reviewer1  | Reviewer2  | Reviewer3  |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713        | len=42     | len=42     | len=42     |
| 2026-04-28 10:23:10.451        | len=42     | len=42     | len=42     |
| 2026-04-28 11:36:30.558        | len=42     | len=42     | len=42     |
| 2026-04-29 01:04:34.315        | len=42     | len=42     | len=42     |
| 2026-04-29 03:26:49.629        | len=42     | len=42     | len=42     |
| 2026-04-29 04:13:37.353        | len=42     | len=42     | len=42     |

---

cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success'))    # → True
print(len(result.get('content')))  # → 17172
"

---

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

---

# agent/skill_commands.py:425-427
if skill_content is None:
    return None  # Don't return truthy stub that overwrites user_instruction

RAW_BUFFERClick to expand / collapse

Summary

Reproduction

Configure a webhook route in ~/.hermes/profiles/<profile>/config.yaml with skills: [<any-skill>] and prompt: '{message}'
Start the gateway: hermes gateway run --profile <profile> (or via systemd)
Run for 24+ hours (or possibly less — ~7h empirical onset)
POST a webhook → observe inbound message becomes [Failed to load skill: <name>] instead of the user's payload
Restart gateway → next POST succeeds, real payload arrives, skill loads normally

Source-code path (Hermes-agent v0.11.0)

gateway/platforms/webhook.py:401-405 — calls build_skill_invocation_message(cmd_key, user_instruction=prompt) synchronously inside aiohttp request handler
agent/skill_commands.py:425-427 — when _load_skill_payload returns None, returns truthy stub string '[Failed to load skill: <name>]' (exactly 42 chars for skill name multi-model-review, mathematically verified)
webhook.py:404-405 — if skill_content: accepts stub as success → prompt = skill_content overwrites user's payload
agent/skill_commands.py:_load_skill_payload :147-148 — bare except Exception: return None swallows all exceptions, NO log line
agent/skill_commands.py:_load_skill_payload :150-151 — if not loaded_skill.get("success"): return None — NO log line

The outer webhook.py:412 except Exception as e: logger.warning("[webhook] Skill loading failed: %s", e) would fire if build_skill_invocation_message raised — it never does (verified via grep across all gateway log files: zero Skill loading failed warnings).

Empirical evidence

Lockstep failure across 3 independent reviewer profiles

| Timestamp UTC                  | Reviewer1  | Reviewer2  | Reviewer3  |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713        | len=42     | len=42     | len=42     |
| 2026-04-28 10:23:10.451        | len=42     | len=42     | len=42     |
| 2026-04-28 11:36:30.558        | len=42     | len=42     | len=42     |
| 2026-04-29 01:04:34.315        | len=42     | len=42     | len=42     |
| 2026-04-29 03:26:49.629        | len=42     | len=42     | len=42     |
| 2026-04-29 04:13:37.353        | len=42     | len=42     | len=42     |

Identical millisecond timestamps + 6/6 identical events across 3 separate Python processes — kills hypotheses involving per-process state divergence (cache races, SQLite contention, profile session resume).

Restart-recovery correlation

2026-04-28 01:55:17 — reviewer1 last successful delivery (prompt_len=21493), uptime ~25h since prior restart
2026-04-28 09:35:36 — reviewer1 first failure (~7.5h additional uptime later)
All subsequent webhooks fail until...
2026-04-29 04:25:18 — reviewer1 SIGTERM/restart (manual, via systemctl --user restart hermes-gateway-reviewer1.service)
2026-04-29 04:25:57 — reviewer1 succeeds (prompt_len=23899), 39 seconds post-restart
2026-04-29 04:30:06 — reviewer1 succeeds again post-second-restart

Process state at investigation

Profile	Uptime	RSS	FD count
reviewer2	2d 1h 26m	315 MB	< 32
reviewer3	2d 1h 26m	302 MB	< 32
reviewer1	4 min (just restarted)	84 MB	< 32
main gateway	2d 1h 26m	441 MB	< 32

No OOM, no FD exhaustion. Memory growth ~150 → 441 MB on main gateway across 2 days (within reasonable bounds).

Direct CLI test proves filesystem + skill content are NOT at fault

cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success'))    # → True
print(len(result.get('content')))  # → 17172
"

Works perfectly from a fresh Python process while the long-running gateway returns the stub. Confirms the regression is runtime-deferred state inside the long-running aiohttp request handler context, not file-level or skill-content-level.

Self-recovery within session

Reviewer log 2026-04-29 01:04:48 (after the webhook handler injected the failure stub):

"The skill loaded successfully. It appears the initial 'Failed to load skill' message was a transient error. The multi-model-review skill is now available..."

The model recovers via in-conversation skill_view tool call AFTER the webhook handler failed to inject content. Two calls in the same Python process produce divergent results — confirms deferred-state corruption, not file-level.

Suspected root cause (not verified)

Suggested upstream fix (minimum)

Add diagnostic logging to surface the silent failure even before root cause is identified:

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

And in the caller (build_skill_invocation_message):

# agent/skill_commands.py:425-427
if skill_content is None:
    return None  # Don't return truthy stub that overwrites user_instruction

With the caller returning None, webhook.py:404 if skill_content: would correctly fall through to user's prompt. Truthy-stub-as-success is the silent failure mode that masks the underlying bug.

Project-side mitigation already deployed

Track B (S131 commit 7e39ced): strict CUID validation at API boundary (/api/agent/multi-model-review) — rejects fabricated IDs from reviewers that hallucinate from holographic memory after losing webhook context. Prevents orphan AgentLearning rows.
Track A (S131): scripts/refire-panel-review.ts for manual recovery of stuck rows.
Q1 stall monitor (S131 commit pending): /api/panel-review-health endpoint + Hermes cron */15min — alerts via Telegram when multi_model_pending rows exceed 30 min.

These mitigate symptoms but not the root cause. Upstream fix is the only durable path.

Environment

Hermes version: 0.11.0
Python: 3.11.14
OS: Linux (long-running production server, systemd-managed)
Project: motherfish-ai-bot (XAUUSD trading agent, 3-reviewer panel pattern)
Reviewer profiles: 3 isolated systemd services (hermes-gateway-reviewer{1,2,3}.service)
Models per profile: kimi-k2.5/bailian (R1), glm-5/bailian (R2), glm-5.1/z.ai (R3)

Cross-reference

Project's S131 session memory (full investigation): .claude/agent-memory/opus-4-7/sessions/131-track-d-skill-load-failure.md
Project's Hermes troubleshooting doc: hermes-doc.md § Skill Auto-Load Failure: prompt_len=42 (S131)
Track B implementation: src/app/api/agent/multi-model-review/route.ts:240-303 (commit 7e39ced)
Track A recovery script: scripts/refire-panel-review.ts

extent analysis

TL;DR

The most likely fix for the issue is to add diagnostic logging to surface the silent failure in the _load_skill_payload function and modify the build_skill_invocation_message function to return None instead of a truthy stub when the skill content is not loaded.

Guidance

Add diagnostic logging to the _load_skill_payload function to log any exceptions that occur during skill loading.
Modify the build_skill_invocation_message function to return None instead of a truthy stub when the skill content is not loaded.
Verify that the changes fix the issue by running the gateway for an extended period and checking that the webhook payload is not overwritten with the stub.
Consider adding additional logging or monitoring to detect and alert on similar issues in the future.

Example

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

Notes

The changes suggested are based on the information provided in the issue and may not be the only possible solution. Additional debugging and testing may be necessary to fully resolve the issue.

Recommendation

Apply the suggested workaround by adding diagnostic logging and modifying the build_skill_invocation_message function to return None instead of a truthy stub. This should help to surface the underlying issue and prevent the webhook payload from being overwritten with the stub.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #tokenizer error #prompt formatting #chain error #conversation history

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix [Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime [1 pull requests, 3 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

agent/skill_commands.py:_load_skill_payload

Root Cause

Suspected root cause (not verified)

Fix Action

Fix / Workaround

Project-side mitigation already deployed

PR fix notes

PR #17330: fix(skills): return None instead of truthy stub when skill load fails (#17283)

Description (problem / solution / changelog)

Summary

Root Cause

Fix

Why this works

Testing

Changed files

Code Example

Summary

Reproduction

Source-code path (Hermes-agent v0.11.0)

Empirical evidence

Lockstep failure across 3 independent reviewer profiles

Restart-recovery correlation

Process state at investigation

Direct CLI test proves filesystem + skill content are NOT at fault

Self-recovery within session

Suspected root cause (not verified)

Suggested upstream fix (minimum)

Project-side mitigation already deployed

Environment

Cross-reference

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING