hermes - 💡(How to fix) Fix [Feature]: Executable reflexes and confidence scoring for skills [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#17042Fetched 2026-04-29 06:37:39
View on GitHub
Comments
0
Participants
1
Timeline
5
Reactions
1
Author
Participants
Timeline (top)
labeled ×4subscribed ×1

Hermes already has a strong skill system for preserving reusable procedures across sessions. I would like to propose a next step: evolve skills from static procedural documents into an observable, confidence-aware system that can identify repeated task patterns, suggest executable “reflexes”, and safely route recurring work to deterministic implementations when appropriate.

This is not a request for immediate fully autonomous code generation. The safer initial goal is to add instrumentation, candidate generation, and user-controlled promotion paths so Hermes can learn which skills actually work, which ones fail, and which repeated workflows deserve to become executable helpers.

Root Cause

This would improve Hermes in several ways:

  • reduce repeated token spend on workflows that have become routine
  • make the skill ecosystem measurable rather than anecdotal
  • help users identify stale or failing skills
  • create a safe path from “Markdown procedure” to “tested executable helper”
  • preserve Hermes' current self-improving philosophy while adding feedback loops
  • support future skill marketplaces by making reliability visible

Fix Action

Fix / Workaround

  • “This workflow appears repeatable. Consider turning it into a script-backed skill.”
  • “This skill has repeated failures. Consider patching the skill.”
  • “This task pattern may be suitable for a deterministic helper.”
RAW_BUFFERClick to expand / collapse

Feature request: executable reflexes and confidence scoring for Hermes skills

Summary

Hermes already has a strong skill system for preserving reusable procedures across sessions. I would like to propose a next step: evolve skills from static procedural documents into an observable, confidence-aware system that can identify repeated task patterns, suggest executable “reflexes”, and safely route recurring work to deterministic implementations when appropriate.

This is not a request for immediate fully autonomous code generation. The safer initial goal is to add instrumentation, candidate generation, and user-controlled promotion paths so Hermes can learn which skills actually work, which ones fail, and which repeated workflows deserve to become executable helpers.

Motivation

Hermes skills are already one of the most valuable parts of the project: they let the agent persist workflows, environment-specific procedures, and lessons learned. However, most skills today are still loaded as context and interpreted by the model each time. That means recurring task classes still consume model tokens and reasoning time even when the same workflow has succeeded many times before.

A useful next step would be to distinguish between:

  1. Documentation-like skills — procedural guidance for the model.
  2. Verified skills — skills with observable success/failure history.
  3. Executable reflexes — deterministic scripts or wrappers that implement a repeated workflow and can be invoked directly when confidence is high.

This would make Hermes more self-improving while keeping safety and user control at the center.

Goals

A possible staged implementation could look like this:

Stage 1: Skill usage and outcome telemetry

Add first-class insight into whether skills are working, not just whether they were loaded.

Potential metrics:

  • skill view/manage count
  • last used time
  • associated task/session IDs
  • inferred success/failure outcome
  • failure reasons when available
  • tool calls commonly associated with the skill
  • average number of API calls / tool calls after loading the skill

This could extend the existing /insights direction and should ideally avoid a database schema migration if current session/tool call records are sufficient.

Stage 2: Confidence scoring for skills

Introduce a lightweight confidence model for skills, based on observed outcomes.

Example fields:

  • confidence_score
  • success_count
  • failure_count
  • last_success_at
  • last_failure_at
  • known_failure_modes

The score should be advisory at first. It should help users and agents understand which skills are reliable and which ones need maintenance.

Stage 3: Reflex candidate detection

Detect repeated task patterns where Hermes repeatedly follows a similar sequence successfully.

Possible signals:

  • similar user intent across sessions
  • same skill loaded repeatedly
  • similar tool call sequence
  • similar files/commands involved
  • high success rate
  • low variance in execution steps

Hermes could then suggest:

  • “This workflow appears repeatable. Consider turning it into a script-backed skill.”
  • “This skill has repeated failures. Consider patching the skill.”
  • “This task pattern may be suitable for a deterministic helper.”

Stage 4: User-approved executable reflexes

Only after user approval, allow a skill to point to an executable helper such as a Python script, shell-safe wrapper, or tool implementation.

Important safety constraints:

  • never auto-enable generated code without user review
  • default to dry-run or advisory mode for new reflexes
  • sandbox or isolate execution where practical
  • record success/failure after every reflex execution
  • automatically lower confidence on runtime errors
  • fall back to normal LLM reasoning when confidence is low or a reflex fails

Desired behavior

For a recurring task, Hermes could choose among several routes:

  1. REFLEX — high-confidence deterministic helper, no LLM reasoning needed except final explanation if desired.
  2. FAST — exact or near-exact previous solution retrieval.
  3. HYBRID — use previous successful traces or skill context, but still reason with the model.
  4. SLOW — normal exploratory agent loop for novel tasks.

The initial implementation does not need to implement all routing modes. Even just exposing skill outcome metrics and candidate suggestions would be valuable.

Why this matters

This would improve Hermes in several ways:

  • reduce repeated token spend on workflows that have become routine
  • make the skill ecosystem measurable rather than anecdotal
  • help users identify stale or failing skills
  • create a safe path from “Markdown procedure” to “tested executable helper”
  • preserve Hermes' current self-improving philosophy while adding feedback loops
  • support future skill marketplaces by making reliability visible

Related and inspiring projects / ideas

NARE — Neuro-Adaptive Reasoning Engine

Repository: https://github.com/starface77/Neuro-Adaptive-Reasoning-Engine

Relevant idea: amortizing repeated reasoning into executable reflexes.

NARE proposes a routing model where repeated reasoning traces can be consolidated into deterministic Python-based “reflexes”. It distinguishes between slower exploratory reasoning and faster local execution for recurring logical patterns. The most relevant concepts for Hermes are:

  • reasoning amortization
  • executable reflexes
  • confidence-gated skill registry
  • fallback to inference when a reflex fails or confidence is low

Hermes does not need to adopt NARE wholesale, but the general idea maps naturally onto Hermes skills.

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents

Paper: https://arxiv.org/abs/2604.16839

Relevant idea: memory should capture associations and consolidate repeated patterns, not just retrieve unstructured embeddings.

The paper proposes episodic memory graphs, consolidation, and semantic distillation. For Hermes, this suggests that repeated successful sessions could be distilled into more structured skill knowledge or executable candidates.

Related existing issues

I found a few related discussions that seem complementary rather than exact duplicates:

  • #12981 — Skills System Architecture Redesign: From Static Files to Living Knowledge
  • #10666 — On-demand skill installation during setup + skill quality lifecycle
  • #6191 — Skill Auto-Detection System
  • #625 — Structured Temporal Memory with Confidence-Gated Facts

This proposal focuses specifically on per-skill outcome telemetry, confidence scoring, and a conservative path from repeated successful skill usage to user-approved executable reflexes.

Safety considerations

This feature should be conservative by default:

  • No generated executable code should run automatically without explicit user approval.
  • Reflexes should have a confidence score and a fallback path.
  • Runtime failures should reduce confidence.
  • Users should be able to inspect why a reflex was selected.
  • Users should be able to disable reflex routing globally or per skill.
  • Any executable reflex should respect existing Hermes tool permissions, platform constraints, and approval flows.

Possible acceptance criteria for an MVP

A minimal first version could be considered successful if it provides:

  • /insights or equivalent output showing per-skill usage and inferred success/failure stats.
  • A way to identify skills with repeated failures.
  • A way to identify repeated successful task patterns as reflex candidates.
  • No automatic code execution or routing changes by default.
  • A documented design path for future user-approved executable reflexes.

Closing note

Hermes already has the foundation for this: persistent sessions, skill loading, tool call records, memory providers, and a plugin-friendly architecture. This proposal is about adding feedback loops and confidence-aware evolution so skills can gradually move from static instructions toward verified, reusable, and eventually executable capabilities.

extent analysis

TL;DR

Implement a staged approach to evolve Hermes skills into an observable, confidence-aware system by adding skill usage and outcome telemetry, confidence scoring, reflex candidate detection, and user-approved executable reflexes.

Guidance

  • Introduce skill usage and outcome telemetry to gather metrics such as skill view/manage count, last used time, and inferred success/failure outcome.
  • Develop a lightweight confidence model for skills based on observed outcomes, including fields like confidence_score, success_count, and failure_count.
  • Detect repeated task patterns and suggest reflex candidates, considering signals like similar user intent, tool call sequences, and high success rates.
  • Design a user-approved executable reflex system with safety constraints, such as never auto-enabling generated code without user review and defaulting to dry-run or advisory mode for new reflexes.

Example

No code snippet is provided as the issue focuses on proposing a feature and its design, rather than implementing specific code changes.

Notes

The proposed feature requires careful consideration of safety constraints to ensure that generated executable code is handled responsibly and with user oversight. The implementation should prioritize transparency, user control, and fallback mechanisms to maintain the reliability and trustworthiness of the Hermes system.

Recommendation

Apply a staged implementation approach, starting with skill usage and outcome telemetry, followed by confidence scoring, reflex candidate detection, and finally user-approved executable reflexes, to ensure a safe and effective evolution of Hermes skills.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING