hermes - ✅(Solved) Fix [Feature]: Production-Grade Autonomous Evolution Engine (GASP Loop) for Hermes-Agent [1 pull requests, 2 comments, 2 participants]

hermes2026-04-30 21:32:03

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#18092•Fetched 2026-05-01 05:53:54

View on GitHub

Comments

Participants

Timeline

Reactions

Author

RUFFY-369

Participants

alt-glitch

RUFFY-369

Timeline (top)

labeled ×3commented ×2cross-referenced ×1

Fix Action

Fixed

Fixed by PR: feat: implement production-grade autonomous evolution engine and GASP orchestrator (https://github.com/NousResearch/hermes-agent/pull/18096)

PR fix notes

PR #18096: feat: implement production-grade autonomous evolution engine and GASP orchestrator

Repository: NousResearch/hermes-agent
Author: RUFFY-369
State: open | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/18096

Description (problem / solution / changelog)

What does this PR do?

This PR introduces the Hermes Autonomous Evolution Engine, a production-hardened pipeline for recursive model self-improvement. Optimized for H100/L40S (48GB VRAM) hardware, it implements a stable concurrent training/inference loop using GRPO and SGLang.

This implementation builds upon the foundational research in our earlier self-evolution experiments, extending those capabilities into a unified weight-optimization pipeline:

Memory Efficiency: By using a shared-weight strategy, we reduced VRAM usage by 35%, enabling training and inference to co-exist on a single 48GB GPU.
Optimization Rigor: Evolves from prompt-level guidance to GRPO (Group Relative Policy Optimization), providing more stable convergence and better generalization on multi-constraint tasks.
Zero-Downtime Hot-Swap: Integrated LoRASyncEngine allows reasoning updates without server restarts, preserving prefix caches.
Hindsight Guidance: Introduces a PRM-based "Hindsight Hinting" mechanism to convert failure tracebacks into actionable training signals.

Related Issue

Fixes #18092

Type of Change

✨ New feature (non-breaking change that adds functionality)
✅ Tests (adding or improving test coverage)
♻️ Refactor (standardizing codebase and removing conversational slop)

Changes Made

evolution/: Core logic for the autonomous evolution loop.
- grpo_trainer.py: Memory-efficient GRPO with shared weights and global advantage normalization.
- orchestrator.py: GASP loop manager with multi-constraint task rotation.
- sandbox.py: Asynchronous Docker sandbox with dense reward shaping.
- sync.py: LoRASyncEngine for zero-downtime hot-swapping.
- judge.py: PRM-based hindsight hint extraction.
rl_cli.py: Unified CLI updated with the --evolution loop and H100 optimizations.
scripts/: Production deployment utilities.
- evolution_launch_sglang.sh: Optimized SGLang launch parameters (TP, mem-fraction).
- benchmark_evolution.py: Automated regression suite for evaluating evolution deltas.

How to Test

Launch Inference: Run bash scripts/evolution_launch_sglang.sh to start the SGLang server with LoRA support and optimized VRAM partitioning.
Execute Evolution: Run python rl_cli.py --evolution --iterations 10 to start a small recursive evolution cycle.
Verify Synchronization: Confirm that "LoRA Sync Success" appears in the logs after the first training iteration.
Benchmark: Run python scripts/benchmark_evolution.py --adapter output/adapter_active to evaluate the delta against the base model.

Checklist

Code

I've read the Contributing Guide
My commit messages follow Conventional Commits
My PR contains only changes related to this feature
I've tested on my platform: Ubuntu 24.04 (L40S / H100)

Documentation & Housekeeping

I've updated relevant documentation (README, docstrings)
I've updated CONTRIBUTING.md or AGENTS.md if I changed architecture or workflows

Screenshots / Logs

Evaluating: Base Model
   Task: N-Queens (No If)... Done. Mean Reward: -0.25
   Task: Matrix Rotation (No Loops)... Done. Mean Reward: +0.20

Syncing adapter output/adapter_active to SGLang...
LoRA Sync Success: Adapter 'eval_policy' updated from output/adapter_active

Evaluating: output/adapter_active
   Task: N-Queens (No If)... Done. Mean Reward: +0.85
   Task: Matrix Rotation (No Loops)... Done. Mean Reward: +1.00

============================================================
TASK                           | BASE     | TARGET   | DELTA   
------------------------------------------------------------
N-Queens (No If)               | -0.25    | +0.85    | +1.10   
Matrix Rotation (No Loops)     | +0.20    | +1.00    | +0.80   
============================================================

## Changed files

- `evolution/__init__.py` (added, +0/-0)
- `evolution/client.py` (added, +80/-0)
- `evolution/grpo_trainer.py` (added, +179/-0)
- `evolution/judge.py` (added, +39/-0)
- `evolution/opd_trainer.py` (added, +59/-0)
- `evolution/orchestrator.py` (added, +116/-0)
- `evolution/sandbox.py` (added, +161/-0)
- `evolution/sync.py` (added, +54/-0)
- `evolution/tinker.py` (added, +73/-0)
- `pyproject.toml` (modified, +1/-1)
- `rl_cli.py` (modified, +197/-41)
- `scripts/benchmark_evolution.py` (added, +127/-0)
- `scripts/evolution_launch_sglang.sh` (added, +25/-0)
- `scripts/test_live_evolution.py` (added, +85/-0)
- `scripts/test_tinker_handshake.py` (added, +54/-0)

RAW_BUFFERClick to expand / collapse

Problem or Use Case

While previous attempts such as hermes-agent-self-evolution explored the concept of self-improvement, they were limited by a reliance on prompt-only optimization and simple Rejection Sampling (SFT). These approaches hit a "reasoning ceiling" and failed to improve the underlying model weights for complex, multi-constraint coding tasks. Furthermore, running true Reinforcement Learning (RL) on single-GPU (48GB) hardware alongside a production inference server has historically been difficult due to the high VRAM overhead of dual-model weight loading and optimizer states. I am looking for a first-class evolution pipeline integrated into hermes-agent that enables mathematically rigorous RL while maintaining a minimal memory footprint, allowing the agent to recursively self-improve its own reasoning and coding policies.

Proposed Solution

I propose the implementation of a Guided Asymmetric Self-Play (GASP) loop. This system provides a production-hardened recursive training environment.

1. Key Differentiators

GRPO vs SFT: Uses Group Relative Policy Optimization for mathematically rigorous policy updates. Unlike SFT-based self-evolution, this allows the model to learn from comparative quality across group rollouts.
Shared-Weight Architecture: Proactively shares base weights between the trainer and inference engine via disable_adapter(), saving ~16GB of VRAM and ensuring co-existence on single 48GB nodes.
Hindsight Hinting: Uses a Process Reward Model (PRM) to extract "logic fixes" from failed trajectories, seeding the next batch with guidance to prevent reward collapse.

2. Proposed Components

GRPOTrainer: Memory-efficient RL trainer with masked KL divergence, bfloat16 support, and global advantage normalization.
LoRASyncEngine: Zero-Downtime hot-swapping of LoRA adapters via SGLang API to preserve KV-caches (Prefix Caching).
Docker Sandbox: Asynchronous, secure execution environment for code grading with shaped reward signaling.
rl_cli.py --evolution: A unified CLI for starting, monitoring, and benchmarking autonomous runs. This implementation provides the stability and efficiency required for large-scale, recursive self-evolution.

Alternatives Considered

No response

Feature Type

Other

Scope

None

Contribution

I'd like to implement this myself and submit a PR

Debug Report (optional)

extent analysis

TL;DR

Implementing a Guided Asymmetric Self-Play (GASP) loop with a Shared-Weight Architecture and Group Relative Policy Optimization (GRPO) can help achieve a first-class evolution pipeline for recursive self-improvement.

Guidance

To reduce VRAM overhead, consider implementing the proposed Shared-Weight Architecture by utilizing disable_adapter() to share base weights between the trainer and inference engine.
Explore the use of Group Relative Policy Optimization (GRPO) for mathematically rigorous policy updates, allowing the model to learn from comparative quality across group rollouts.
Implement Hindsight Hinting using a Process Reward Model (PRM) to extract "logic fixes" from failed trajectories and prevent reward collapse.
Develop a memory-efficient RL trainer, such as GRPOTrainer, with features like masked KL divergence and global advantage normalization.

Example

No specific code snippet is provided due to the high-level nature of the proposal, but an example of how GRPO might be implemented could involve modifying the existing policy optimization loop to incorporate comparative quality metrics.

Notes

The success of this approach relies on the effective implementation of the proposed components, including the GRPOTrainer, LoRASyncEngine, and Docker Sandbox. Further testing and evaluation are necessary to ensure the stability and efficiency of the GASP loop.

Recommendation

Apply the proposed Guided Asymmetric Self-Play (GASP) loop with a Shared-Weight Architecture and Group Relative Policy Optimization (GRPO), as it addresses the limitations of previous approaches and provides a mathematically rigorous framework for recursive self-improvement.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.