hermes - ✅(Solved) Fix [Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime [1 pull requests, 3 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17283Fetched 2026-04-30 06:48:40
View on GitHub
Comments
3
Participants
2
Timeline
13
Reactions
0
Author
Timeline (top)
labeled ×5commented ×3closed ×1cross-referenced ×1

In a long-running gateway process (hermes gateway run for ~7+ hours), webhook routes configured with skills: start producing a 42-character stub ([Failed to load skill: <name>]) that silently overwrites the user's webhook payload. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.

Error Message

agent/skill_commands.py:_load_skill_payload

try: loaded_skill = ... except Exception as e: logger.warning("_load_skill_payload failed: %s", e, exc_info=True) return None

if not loaded_skill.get("success"): logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill) return None

Root Cause

Suspected root cause (not verified)

Fix Action

Fix / Workaround

Module-level cache in agent/skill_commands.py (or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks _load_skill_payload. The bare except Exception: return None at lines 147-148 swallows the underlying exception silently.

Project-side mitigation already deployed

PR fix notes

PR #17330: fix(skills): return None instead of truthy stub when skill load fails (#17283)

Description (problem / solution / changelog)

Summary

Fixes #17283 — webhook gateway skill auto-loader returns truthy stub string instead of None on load failure, silently overwriting the user's prompt.

Root Cause

build_skill_invocation_message() returned "[Failed to load skill: <name>]" when _load_skill_payload() failed. The webhook handler at gateway/platforms/webhook.py:401-405 checks if skill_content: and substitutes the prompt — a truthy stub passes that check, replacing the real user payload with a 42-character stub.

Additionally, _load_skill_payload() caught all exceptions with a bare except Exception: return None and logged nothing, making the failure invisible in gateway logs.

Fix

  1. Return None from build_skill_invocation_message() when payload loading fails (consistent with the unknown-command path that already returns None)
  2. Add logger.warning() calls in _load_skill_payload() for both exception and success=false paths so failures are visible in logs
  3. Regression test verifying None is returned (not a truthy string) when payload load fails

Why this works

The webhook caller already handles None correctly — if skill_content: is False for None, so the original user prompt is preserved. The fix aligns the failure path with what callers expect.

Testing

$ python3 -m pytest tests/agent/test_skill_commands.py -xvs -o 'addopts='
======================= 36 passed in 1.58s =======================

All 36 existing tests pass + 1 new regression test.

Changed files

  • agent/skill_commands.py (modified, +4/-1)
  • tests/agent/test_skill_commands.py (modified, +19/-0)

Code Example

| Timestamp UTC                  | Reviewer1  | Reviewer2  | Reviewer3  |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713        | len=42     | len=42     | len=42     |
| 2026-04-28 10:23:10.451        | len=42     | len=42     | len=42     |
| 2026-04-28 11:36:30.558        | len=42     | len=42     | len=42     |
| 2026-04-29 01:04:34.315        | len=42     | len=42     | len=42     |
| 2026-04-29 03:26:49.629        | len=42     | len=42     | len=42     |
| 2026-04-29 04:13:37.353        | len=42     | len=42     | len=42     |

---

cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success'))    # → True
print(len(result.get('content')))  # → 17172
"

---

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

---

# agent/skill_commands.py:425-427
if skill_content is None:
    return None  # Don't return truthy stub that overwrites user_instruction
RAW_BUFFERClick to expand / collapse

Summary

In a long-running gateway process (hermes gateway run for ~7+ hours), webhook routes configured with skills: start producing a 42-character stub ([Failed to load skill: <name>]) that silently overwrites the user's webhook payload. Reviewer agents receive the stub instead of the real prompt, lose all context, and either give up or fabricate responses from memory. The gateway logs no warning. Restarting the gateway cures the issue immediately.

Reproduction

  1. Configure a webhook route in ~/.hermes/profiles/<profile>/config.yaml with skills: [<any-skill>] and prompt: '{message}'
  2. Start the gateway: hermes gateway run --profile <profile> (or via systemd)
  3. Run for 24+ hours (or possibly less — ~7h empirical onset)
  4. POST a webhook → observe inbound message becomes [Failed to load skill: <name>] instead of the user's payload
  5. Restart gateway → next POST succeeds, real payload arrives, skill loads normally

Source-code path (Hermes-agent v0.11.0)

  1. gateway/platforms/webhook.py:401-405 — calls build_skill_invocation_message(cmd_key, user_instruction=prompt) synchronously inside aiohttp request handler
  2. agent/skill_commands.py:425-427 — when _load_skill_payload returns None, returns truthy stub string '[Failed to load skill: <name>]' (exactly 42 chars for skill name multi-model-review, mathematically verified)
  3. webhook.py:404-405if skill_content: accepts stub as success → prompt = skill_content overwrites user's payload
  4. agent/skill_commands.py:_load_skill_payload :147-148 — bare except Exception: return None swallows all exceptions, NO log line
  5. agent/skill_commands.py:_load_skill_payload :150-151if not loaded_skill.get("success"): return NoneNO log line

The outer webhook.py:412 except Exception as e: logger.warning("[webhook] Skill loading failed: %s", e) would fire if build_skill_invocation_message raised — it never does (verified via grep across all gateway log files: zero Skill loading failed warnings).

Empirical evidence

Lockstep failure across 3 independent reviewer profiles

| Timestamp UTC                  | Reviewer1  | Reviewer2  | Reviewer3  |
|--------------------------------|------------|------------|------------|
| 2026-04-28 09:35:36.713        | len=42     | len=42     | len=42     |
| 2026-04-28 10:23:10.451        | len=42     | len=42     | len=42     |
| 2026-04-28 11:36:30.558        | len=42     | len=42     | len=42     |
| 2026-04-29 01:04:34.315        | len=42     | len=42     | len=42     |
| 2026-04-29 03:26:49.629        | len=42     | len=42     | len=42     |
| 2026-04-29 04:13:37.353        | len=42     | len=42     | len=42     |

Identical millisecond timestamps + 6/6 identical events across 3 separate Python processes — kills hypotheses involving per-process state divergence (cache races, SQLite contention, profile session resume).

Restart-recovery correlation

  • 2026-04-28 01:55:17 — reviewer1 last successful delivery (prompt_len=21493), uptime ~25h since prior restart
  • 2026-04-28 09:35:36 — reviewer1 first failure (~7.5h additional uptime later)
  • All subsequent webhooks fail until...
  • 2026-04-29 04:25:18 — reviewer1 SIGTERM/restart (manual, via systemctl --user restart hermes-gateway-reviewer1.service)
  • 2026-04-29 04:25:57 — reviewer1 succeeds (prompt_len=23899), 39 seconds post-restart
  • 2026-04-29 04:30:06 — reviewer1 succeeds again post-second-restart

Process state at investigation

ProfileUptimeRSSFD count
reviewer22d 1h 26m315 MB< 32
reviewer32d 1h 26m302 MB< 32
reviewer14 min (just restarted)84 MB< 32
main gateway2d 1h 26m441 MB< 32

No OOM, no FD exhaustion. Memory growth ~150 → 441 MB on main gateway across 2 days (within reasonable bounds).

Direct CLI test proves filesystem + skill content are NOT at fault

cd ~/.hermes/hermes-agent && python3 -c "
from tools.skills_tool import skill_view
import json
result = json.loads(skill_view('multi-model-review'))
print(result.get('success'))    # → True
print(len(result.get('content')))  # → 17172
"

Works perfectly from a fresh Python process while the long-running gateway returns the stub. Confirms the regression is runtime-deferred state inside the long-running aiohttp request handler context, not file-level or skill-content-level.

Self-recovery within session

Reviewer log 2026-04-29 01:04:48 (after the webhook handler injected the failure stub):

"The skill loaded successfully. It appears the initial 'Failed to load skill' message was a transient error. The multi-model-review skill is now available..."

The model recovers via in-conversation skill_view tool call AFTER the webhook handler failed to inject content. Two calls in the same Python process produce divergent results — confirms deferred-state corruption, not file-level.

Suspected root cause (not verified)

Module-level cache in agent/skill_commands.py (or one of its callees in the dispatch chain) accumulates state across long-running aiohttp request handler invocations and eventually breaks _load_skill_payload. The bare except Exception: return None at lines 147-148 swallows the underlying exception silently.

Suggested upstream fix (minimum)

Add diagnostic logging to surface the silent failure even before root cause is identified:

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

And in the caller (build_skill_invocation_message):

# agent/skill_commands.py:425-427
if skill_content is None:
    return None  # Don't return truthy stub that overwrites user_instruction

With the caller returning None, webhook.py:404 if skill_content: would correctly fall through to user's prompt. Truthy-stub-as-success is the silent failure mode that masks the underlying bug.

Project-side mitigation already deployed

  • Track B (S131 commit 7e39ced): strict CUID validation at API boundary (/api/agent/multi-model-review) — rejects fabricated IDs from reviewers that hallucinate from holographic memory after losing webhook context. Prevents orphan AgentLearning rows.
  • Track A (S131): scripts/refire-panel-review.ts for manual recovery of stuck rows.
  • Q1 stall monitor (S131 commit pending): /api/panel-review-health endpoint + Hermes cron */15min — alerts via Telegram when multi_model_pending rows exceed 30 min.

These mitigate symptoms but not the root cause. Upstream fix is the only durable path.

Environment

  • Hermes version: 0.11.0
  • Python: 3.11.14
  • OS: Linux (long-running production server, systemd-managed)
  • Project: motherfish-ai-bot (XAUUSD trading agent, 3-reviewer panel pattern)
  • Reviewer profiles: 3 isolated systemd services (hermes-gateway-reviewer{1,2,3}.service)
  • Models per profile: kimi-k2.5/bailian (R1), glm-5/bailian (R2), glm-5.1/z.ai (R3)

Cross-reference

  • Project's S131 session memory (full investigation): .claude/agent-memory/opus-4-7/sessions/131-track-d-skill-load-failure.md
  • Project's Hermes troubleshooting doc: hermes-doc.md § Skill Auto-Load Failure: prompt_len=42 (S131)
  • Track B implementation: src/app/api/agent/multi-model-review/route.ts:240-303 (commit 7e39ced)
  • Track A recovery script: scripts/refire-panel-review.ts

extent analysis

TL;DR

The most likely fix for the issue is to add diagnostic logging to surface the silent failure in the _load_skill_payload function and modify the build_skill_invocation_message function to return None instead of a truthy stub when the skill content is not loaded.

Guidance

  • Add diagnostic logging to the _load_skill_payload function to log any exceptions that occur during skill loading.
  • Modify the build_skill_invocation_message function to return None instead of a truthy stub when the skill content is not loaded.
  • Verify that the changes fix the issue by running the gateway for an extended period and checking that the webhook payload is not overwritten with the stub.
  • Consider adding additional logging or monitoring to detect and alert on similar issues in the future.

Example

# agent/skill_commands.py:_load_skill_payload
try:
    loaded_skill = ...
except Exception as e:
    logger.warning("_load_skill_payload failed: %s", e, exc_info=True)
    return None

if not loaded_skill.get("success"):
    logger.warning("_load_skill_payload skill_view returned success=False: %r", loaded_skill)
    return None

Notes

The changes suggested are based on the information provided in the issue and may not be the only possible solution. Additional debugging and testing may be necessary to fully resolve the issue.

Recommendation

Apply the suggested workaround by adding diagnostic logging and modifying the build_skill_invocation_message function to return None instead of a truthy stub. This should help to surface the underlying issue and prevent the webhook payload from being overwritten with the stub.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: Webhook gateway — skill auto-loader returns 'Failed to load skill' stub instead of None, silently dropping user prompt after extended uptime [1 pull requests, 3 comments, 2 participants]