hermes - ✅(Solved) Fix [Bug]: scan_skill_commands unconditionally clears _skill_commands before try block, silently loses all skills on scan failure [6 pull requests, 2 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#18659Fetched 2026-05-03 04:55:07
View on GitHub
Comments
2
Participants
2
Timeline
13
Reactions
0
Timeline (top)
cross-referenced ×7labeled ×4commented ×2

Error Message

agent/skill_commands.py:221-227

global _skill_commands _skill_commands = {} # ← cleared BEFORE try — this is the problem try: from tools.skills_tool import SKILLS_DIR, _parse_frontmatter, ... from agent.skill_utils import get_external_skills_dirs, iter_skill_index_files disabled = _get_disabled_skill_names() ... except Exception: pass # ← exception swallowed, _skill_commands stays empty return _skill_commands

Root Cause

  1. Make the skills directory unreadable: chmod 000 ~/.hermes/skills
  2. Start a new hermes session: hermes
  3. Observe that all skill slash commands are unavailable (e.g. /apple-notes, /claude-code, etc.)
  4. (Optionally restore permissions: chmod 755 ~/.hermes/skills, then run /reload-skills — skills will reappear because reload_skills() snapshots before={} and the rescan now succeeds, showing all 90 skills as "added")

Fix Action

Fixed

PR fix notes

PR #18668: fix(skills): preserve command cache on scan setup failure

Description (problem / solution / changelog)

Summary

  • build skill slash commands in a local map before replacing the global cache
  • keep the previous _skill_commands mapping available when scan setup or directory traversal fails
  • add a regression test for preserving cached commands when scan setup raises

Fixes #18659.

Verification

  • scripts/run_tests.sh tests/agent/test_skill_commands.py::TestScanSkillCommands::test_preserves_cached_commands_when_scan_setup_fails tests/agent/test_skill_commands.py
  • git diff --check origin/main...HEAD

Overlap check

  • Checked open PR #2868, which adds logging around swallowed exceptions but does not remove the unconditional cache clear. This PR keeps the scope to the cache-preservation bug called out in #18659.

Changed files

  • agent/skill_commands.py (modified, +7/-5)
  • tests/agent/test_skill_commands.py (modified, +23/-0)

PR #18669: fix(agent): preserve skill commands on scan failure + log diagnostic warning

Description (problem / solution / changelog)

Closes #18659.

Symptom

Any exception during scan_skill_commands() silently blanks all 90+ skill slash commands (/apple-notes, /claude-code, etc.) with zero user-facing error. The user sees their skills disappear but has no log line to diagnose it. Reporter's reproducer: chmod 000 ~/.hermes/skills; hermes → all /skill-name autocompletions gone.

Root cause

agent/skill_commands.py:222 unconditionally clears the global before entering the try block:

```python _skill_commands = {} # ← cleared FIRST try: from tools.skills_tool import ... ... populate _skill_commands ... except Exception: pass # ← swallowed, global stays {} return _skill_commands ```

Any failure in the imports, outer scan loop, or get_external_skills_dirs() leaves the global at {}, and the except: pass swallows the reason.

Fix

Atomic swap — build into local new_commands, only assign to the global on success. Failure returns the last-known-good mapping.

Diagnostic log — replace except Exception: pass with logger.warning(..., exc_info=True), gated by a module-level _scan_error_logged flag so get_skill_commands() hot-loop callers don't flood the log. First failure emits a full traceback; subsequent failures stay silent until a successful scan resets the flag.

Before / after

ScenarioBeforeAfter
First scan hits unreadable skills dirglobal = {}, all slash commands gone, no logglobal unchanged at initial {}, one WARNING traceback
Scan starts healthy then dir breaks mid-sessionglobal wiped to {}, 90 skills goneglobal retains last-known-good 90 entries, one WARNING traceback
Perms restored + /reload-skillsshows 90 as "added" from empty before-snapshot (misleading diff)shows correct no-op diff when state was preserved
Persistent failure, 1000 calls to get_skill_commands()1000× traceback log flood1× traceback, then silent

Tests

5 new, 10 existing → 15/15 pass on TestScanSkillCommands (40/40 on full file):

  • test_preserves_previous_mapping_on_import_failure
  • test_logs_warning_on_scan_failure
  • test_repeated_scan_failures_do_not_flood_logs
  • test_scan_error_flag_resets_on_success
  • test_successful_scan_still_replaces_mapping

Cross-refs

@alt-glitch noted #18656 (closed, same report) and #2868 (open PR addressing the logging dimension). This PR addresses the core atomic-swap bug plus the diagnostic log dimension together with hot-loop-safe guarding; happy to rebase if #2868's logging approach lands first.

Files

  • agent/skill_commands.py: +17/-3
  • tests/agent/test_skill_commands.py: +132

Changed files

  • agent/skill_commands.py (modified, +25/-4)
  • tests/agent/test_skill_commands.py (modified, +114/-0)

PR #18717: fix(agent): preserve _skill_commands cache on scan failure; cap rglob fallback (#18659, #18675)

Description (problem / solution / changelog)

Summary

Fixes two related bugs in agent/skill_commands.py:

Fix 1 — #18659: Cache cleared on scan failure (P2)

Root cause: scan_skill_commands() set _skill_commands = {} unconditionally before the try block that populated it. Any exception (broken import, unreadable skills dir, etc.) left the global permanently empty with no user-visible error.

Fix: Build results into a local fresh dict inside the try block. On success, atomically replace the global. On failure, log a WARNING and leave the previous cache intact so all previously-registered slash commands keep working.

# Before
_skill_commands = {}    # ← clears BEFORE try — data loss on exception
try:
    ...populate...
except Exception:
    pass                # ← silently swallowed, global stays empty

# After
fresh = {}
try:
    ...populate fresh...
except Exception as exc:
    logger.warning("scan_skill_commands: scan failed, preserving %d cached command(s). Error: %s", ...)
    return _skill_commands   # ← previous cache preserved
_skill_commands = fresh      # ← atomic replace only on success

Fix 2 — #18675: rglob fallback injects node_modules (P2)

Root cause: The _build_skill_message fallback path used subdir.rglob("*") with no directory exclusions or file count cap. Skills with scripts/node_modules/ could inject thousands of file paths into the model context.

Fix: Mirror skill_view()'s intent by:

  1. Skipping any path whose components include common package-manager dirs: node_modules, .venv, venv, __pycache__, .git, .tox, dist, build, .mypy_cache
  2. Capping at _MAX_SUPPORTING_FILES = 50

Tests Added

  • test_preserves_cache_on_total_scan_failure — RuntimeError inside scan leaves cache intact
  • test_preserves_cache_on_import_failure — ImportError inside scan leaves cache intact
  • TestBuildSkillMessageRglobFallback.test_node_modules_excluded_from_fallback_scan — node_modules paths don't appear in skill message
  • TestBuildSkillMessageRglobFallback.test_fallback_caps_at_max_files — 200 reference files → ≤50 listed

Checklist

  • Fixes root cause, not symptom
  • Backward compatible — successful scans behave identically
  • Tests cover both failure modes
  • Logging added for scan failure (warning, not silent)

Changed files

  • agent/skill_commands.py (modified, +48/-9)
  • tests/agent/test_skill_commands.py (modified, +112/-0)

PR #18720: fix: preserve skill command cache on scan failure in scan_skill_commands()

Description (problem / solution / changelog)

Summary

Fixes #18659

scan_skill_commands() unconditionally cleared the module-level _skill_commands dict to {} before entering the try block. When scanning failed (e.g. unreadable directory, broken import chain), the exception was silently swallowed by except Exception: pass, leaving the global empty — all 90+ skill slash commands were lost with zero user-facing error.

Changes

agent/skill_commands.pyscan_skill_commands():

  • Build the new command dict in a local variable (new_commands) instead of writing directly to the global
  • Assign to _skill_commands only on success (via else clause on the try/except)
  • On failure, log a warning (logger.warning("skill scan failed; preserving previous command cache", exc_info=True)) instead of silently discarding the error

tests/agent/test_skill_commands.pyTestScanPreservesCacheOnFailure:

  • test_outer_exception_preserves_existing_commands — verifies previously-cached commands survive a failed scan
  • test_outer_exception_logs_warning — verifies the warning is logged (no more silent failure)
  • test_empty_scan_replaces_cache — verifies successful empty scans still correctly replace the cache (no false preservation)
  • test_reload_skills_preserves_commands_on_scan_failure — verifies reload_skills() diff does not report skills as "removed" when the scan fails

Testing

All 45 existing + 4 new tests pass:

tests/agent/test_skill_commands.py  (39 tests — 35 existing + 4 new)
tests/agent/test_skill_commands_reload.py  (6 tests)

Zero regressions.

Design rationale

The fix follows the pattern suggested in the issue: build locally, assign on success. This is the same atomicity pattern used by reload_skills() (which snapshots before state and diffs against it). The else clause ensures the global is only touched when the scan completes without raising — if any outer exception occurs, the previous cache is preserved intact.

Changed files

  • agent/skill_commands.py (modified, +5/-3)
  • tests/agent/test_skill_commands.py (modified, +81/-0)

PR #18735: fix: resolve 7 identified issues [automated]

Description (problem / solution / changelog)

Summary

This automated PR resolves 7 identified upstream issues focusing on reliability, cross-platform behavior, and security hardening.

Resolved issues

  1. #18722 — cron jobs with next_run_at: null now recover for recurring schedules; scheduler now tolerates non-dict origin values.

    • Files: cron/jobs.py, cron/scheduler.py, tests/cron/test_jobs.py, tests/cron/test_scheduler.py
  2. #18705 — dotenv loading no longer overrides runtime-injected environment variables.

    • Files: hermes_cli/env_loader.py, tests/hermes_cli/test_env_loader.py
  3. #18659scan_skill_commands no longer clears cached commands before a successful rescan.

    • Files: agent/skill_commands.py
  4. #18675 — skill fallback file scan now skips heavy dependency directories and enforces a file cap.

    • Files: agent/skill_commands.py
  5. #18617 — context compressor now synchronizes threshold_percent correctly across model switch, fallback activation, and primary restoration.

    • Files: run_agent.py
  6. #18681 — custom provider /model path now correctly carries provider credentials during model verification path in this branch baseline (already included in upstream branch state; preserved in final branch history).

    • Files: gateway/run.py (resolved in branch baseline)
  7. #18707 — request debug dumps are now redacted before writing to disk/stdout to avoid plaintext secret leakage.

    • Files: run_agent.py

Validation

  • python3 -m py_compile run_agent.py cron/jobs.py cron/scheduler.py hermes_cli/env_loader.py agent/skill_commands.py gateway/run.py
  • pytest -n 0 tests/hermes_cli/test_env_loader.py tests/gateway/test_model_command_custom_providers.py tests/cron/test_jobs.py::TestGetDueJobs::test_broken_cron_without_next_run_is_recovered tests/cron/test_scheduler.py::TestResolveOrigin::test_non_dict_origin_tolerated tests/agent/test_skill_commands.py tests/agent/test_skill_commands_reload.py

Changed files

  • Dockerfile (modified, +3/-2)
  • acp_adapter/session.py (modified, +12/-0)
  • agent/auxiliary_client.py (modified, +280/-28)
  • agent/context_compressor.py (modified, +496/-52)
  • agent/skill_commands.py (modified, +18/-4)
  • agent/title_generator.py (modified, +2/-2)
  • agent/transports/chat_completions.py (modified, +14/-0)
  • agent/usage_pricing.py (modified, +4/-0)
  • cli-config.yaml.example (modified, +5/-0)
  • cli.py (modified, +27/-3)
  • cron/jobs.py (modified, +13/-2)
  • cron/scheduler.py (modified, +14/-4)
  • docker/entrypoint.sh (modified, +9/-1)
  • gateway/channel_directory.py (modified, +14/-4)
  • gateway/platforms/discord.py (modified, +33/-7)
  • gateway/platforms/email.py (modified, +12/-2)
  • gateway/platforms/feishu.py (modified, +34/-1)
  • gateway/platforms/qqbot/adapter.py (modified, +8/-2)
  • gateway/platforms/telegram_network.py (modified, +7/-2)
  • gateway/platforms/weixin.py (modified, +10/-1)
  • gateway/run.py (modified, +129/-32)
  • gateway/status.py (modified, +8/-1)
  • hermes_cli/auth.py (modified, +2/-2)
  • hermes_cli/commands.py (modified, +1/-1)
  • hermes_cli/config.py (modified, +271/-40)
  • hermes_cli/copilot_auth.py (modified, +1/-1)
  • hermes_cli/doctor.py (modified, +6/-1)
  • hermes_cli/env_loader.py (modified, +5/-4)
  • hermes_cli/gateway.py (modified, +16/-13)
  • hermes_cli/main.py (modified, +69/-3)
  • hermes_cli/memory_setup.py (modified, +1/-1)
  • hermes_cli/model_switch.py (modified, +6/-1)
  • hermes_cli/models.py (modified, +60/-2)
  • hermes_cli/profiles.py (modified, +16/-3)
  • hermes_cli/runtime_provider.py (modified, +16/-13)
  • hermes_cli/setup.py (modified, +8/-2)
  • hermes_cli/slack_cli.py (modified, +1/-2)
  • hermes_cli/status.py (modified, +17/-2)
  • hermes_cli/web_server.py (modified, +1/-1)
  • hermes_constants.py (modified, +16/-3)
  • model_tools.py (modified, +44/-13)
  • run_agent.py (modified, +408/-84)
  • setup-hermes.sh (modified, +23/-12)
  • skills/red-teaming/godmode/scripts/load_godmode.py (modified, +9/-8)
  • tests/agent/test_context_compressor.py (modified, +389/-0)
  • tests/agent/transports/test_chat_completions.py (modified, +11/-0)
  • tests/cron/test_jobs.py (modified, +26/-0)
  • tests/cron/test_scheduler.py (modified, +4/-0)
  • tests/gateway/test_compress_command.py (modified, +49/-0)
  • tests/hermes_cli/test_api_key_providers.py (modified, +5/-5)
  • tests/hermes_cli/test_config.py (modified, +17/-0)
  • tests/hermes_cli/test_env_loader.py (modified, +6/-6)
  • tests/run_agent/test_413_compression.py (modified, +81/-1)
  • tests/run_agent/test_compression_boundary_hook.py (modified, +42/-0)
  • tests/run_agent/test_run_agent.py (modified, +100/-13)
  • tests/tools/test_skill_manager_tool.py (modified, +270/-0)
  • tools/approval.py (modified, +1/-1)
  • tools/delegate_tool.py (modified, +4/-1)
  • tools/environments/docker.py (modified, +36/-5)
  • tools/environments/local.py (modified, +8/-1)
  • tools/file_operations.py (modified, +70/-67)
  • tools/file_tools.py (modified, +13/-2)
  • tools/send_message_tool.py (modified, +72/-2)
  • tools/session_search_tool.py (modified, +2/-2)
  • tools/skill_manager_tool.py (modified, +82/-21)
  • tools/skills_tool.py (modified, +13/-1)
  • tools/terminal_tool.py (modified, +6/-0)
  • tools/tool_backend_helpers.py (modified, +15/-5)
  • tools/tts_tool.py (modified, +27/-16)
  • tools/voice_mode.py (modified, +23/-10)
  • toolsets.py (modified, +14/-1)
  • tui_gateway/server.py (modified, +5/-3)
  • ui-tui/src/app/turnController.ts (modified, +1/-1)
  • ui-tui/src/app/useInputHandlers.ts (modified, +8/-3)
  • ui-tui/src/app/useSessionLifecycle.ts (modified, +1/-1)
  • ui-tui/src/gatewayTypes.ts (modified, +1/-0)
  • utils.py (modified, +9/-0)
  • uv.lock (modified, +161/-2)
  • website/docs/reference/environment-variables.md (modified, +1/-1)

PR #2868: fix(skills): log warnings instead of silently swallowing exceptions in skill_commands.py

Description (problem / solution / changelog)

What does this PR do?

Fixes five silent exception blocks in agent/skill_commands.py where bare except Exception: was discarding errors without any logging. Users had no way to diagnose why skills failed to load, scan, or resolve paths.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)

Changes Made

scan_skill_commands()

  • Inner loop (except Exception: continue): Now logs logger.debug() with the skill path and error before continuing — helps identify broken individual skill files without flooding logs.
  • Outer block (except Exception: pass): Now logs logger.warning() when the entire skill scanning system fails — critical for diagnosing import errors or missing SKILLS_DIR.

_load_skill_payload()

  • Path resolution (except Exception: normalized = raw_identifier): Now logs logger.debug() when path normalization fails.
  • Skill loading (except Exception: return None): Now logs logger.warning() when a skill fails to load — users can see which skill and why.
  • Directory resolution (except Exception: skill_dir = None): Now logs logger.debug() when skill directory path computation fails.

No new imports needed — logger = logging.getLogger(__name__) was already defined at the top of the file.

How to Test

  1. Introduce a syntax error in any ~/.hermes/skills/*/SKILL.md file
  2. Run hermes and invoke a skill slash command
  3. Check logs — you should now see warnings/debug messages identifying the failure
  4. Run tests: python3 -m pytest tests/ -q

Checklist

Code

  • I've read the Contributing Guide
  • My commit messages follow Conventional Commits
  • I searched for existing PRs to make sure this isn't a duplicate
  • My PR contains only changes related to this fix
  • I've tested on my platform: Ubuntu 22.04 (WSL2)

Documentation & Housekeeping

  • No documentation changes needed (logging-only change)
  • No config key changes
  • No architecture changes
  • Cross-platform safe (logging module is stdlib)

Screenshots / Logs

N/A — logging change, no UI impact.

Changed files

  • agent/skill_commands.py (modified, +10/-6)

Code Example

# agent/skill_commands.py:221-227
global _skill_commands
_skill_commands = {}           # ← cleared BEFORE trythis is the problem
try:
    from tools.skills_tool import SKILLS_DIR, _parse_frontmatter, ...
    from agent.skill_utils import get_external_skills_dirs, iter_skill_index_files
    disabled = _get_disabled_skill_names()
    ...
except Exception:
    pass                        # ← exception swallowed, _skill_commands stays empty
return _skill_commands

---

def scan_skill_commands() -> Dict[str, Dict[str, Any]]:
    global _skill_commands
    new_commands: Dict[str, Dict[str, Any]] = {}
    try:
        ...
        # build new_commands instead of _skill_commands throughout
    except Exception:
        pass  # keep old _skill_commands intact; log the error
    else:
        _skill_commands = new_commands
    return _skill_commands
RAW_BUFFERClick to expand / collapse

Bug Description

agent/skill_commands.py:222 unconditionally sets _skill_commands = {} before entering the try block that populates it. If any exception occurs between the clear and the first successful skill addition, all 90+ skill slash commands are silently lost with zero user-facing error.

# agent/skill_commands.py:221-227
global _skill_commands
_skill_commands = {}           # ← cleared BEFORE try — this is the problem
try:
    from tools.skills_tool import SKILLS_DIR, _parse_frontmatter, ...
    from agent.skill_utils import get_external_skills_dirs, iter_skill_index_files
    disabled = _get_disabled_skill_names()
    ...
except Exception:
    pass                        # ← exception swallowed, _skill_commands stays empty
return _skill_commands

The _skill_commands = {} assignment at line 222 executes unconditionally — then the entire population logic lives inside a try/except Exception: pass that silently discards any failure. When the exception path is taken, the global stays empty.

Steps to Reproduce

  1. Make the skills directory unreadable: chmod 000 ~/.hermes/skills
  2. Start a new hermes session: hermes
  3. Observe that all skill slash commands are unavailable (e.g. /apple-notes, /claude-code, etc.)
  4. (Optionally restore permissions: chmod 755 ~/.hermes/skills, then run /reload-skills — skills will reappear because reload_skills() snapshots before={} and the rescan now succeeds, showing all 90 skills as "added")

Alternatively, insert a temporary raise ImportError between lines 222 and 223 to simulate a broken import chain, then trigger /reload-skills.

Expected Behavior

If scanning fails, the previously cached _skill_commands should be preserved. The user should still be able to use all previously loaded skill slash commands. An error should be surfaced (to the log, and ideally to the user) so the failure doesn't go unnoticed.

Actual Behavior

  • _skill_commands is unconditionally cleared to {} before the scan attempt.
  • If scanning fails, the exception is silently swallowed (except Exception: pass).
  • The user sees No new skills detected. 📚 0 skill(s) available or an alarming "90 skills removed" diff, with no indication that an error occurred.
  • All skill slash commands become unavailable.

Affected Component

agent/skill_commands.pyscan_skill_commands() function (line 221-277), and by extension reload_skills() (line 287-349) and get_skill_commands() (line 280-284).

Also affected: cli.py:1868 (module-level _skill_commands = scan_skill_commands()), cli.py:6490 (/reload-skills handler), gateway/run.py:4890 (gateway /reload-skills handler), tui_gateway/server.py:4008 (TUI skill command listing).

Debug Report

N/A — no crash or stack trace. The bug manifests as silent data loss due to the except Exception: pass on line 275-276. Log inspection at ~/.hermes/logs/agent.log would show no error.

Proposed Fix

Build the new dict in a local variable, assign to the global only on success:

def scan_skill_commands() -> Dict[str, Dict[str, Any]]:
    global _skill_commands
    new_commands: Dict[str, Dict[str, Any]] = {}
    try:
        ...
        # build new_commands instead of _skill_commands throughout
    except Exception:
        pass  # keep old _skill_commands intact; log the error
    else:
        _skill_commands = new_commands
    return _skill_commands

Operating System

macOS

extent analysis

TL;DR

The issue can be fixed by building the new skill commands dictionary in a local variable and assigning it to the global variable only on successful population.

Guidance

  • Identify the scan_skill_commands function in agent/skill_commands.py and modify it to use a local variable new_commands to build the skill commands dictionary.
  • Move the population logic inside the try block to populate new_commands instead of the global _skill_commands.
  • Assign new_commands to _skill_commands only in the else block of the try-except statement, ensuring that the global variable is updated only on successful population.
  • Consider logging the error in the except block to notify the user of any issues.

Example

def scan_skill_commands() -> Dict[str, Dict[str, Any]]:
    global _skill_commands
    new_commands: Dict[str, Dict[str, Any]] = {}
    try:
        # build new_commands instead of _skill_commands throughout
        # ...
    except Exception as e:
        # log the error
        print(f"Error scanning skill commands: {e}")
    else:
        _skill_commands = new_commands
    return _skill_commands

Notes

The proposed fix ensures that the global _skill_commands variable is not cleared unconditionally, preventing silent data loss in case of an exception. However, it is still important to handle the exception properly, such as logging the error, to notify the user of any issues.

Recommendation

Apply the proposed fix to build the new skill commands dictionary in a local variable and assign it to the global variable only on successful population. This ensures that the global variable is updated only when the population is successful, preventing silent data loss.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix [Bug]: scan_skill_commands unconditionally clears _skill_commands before try block, silently loses all skills on scan failure [6 pull requests, 2 comments, 2 participants]