hermes - 💡(How to fix) Fix Bundled red-team skill descriptions are sent to every fallback provider. OpenAI Cyber Abuse warning triggered.

hermes2026-06-05 17:47:45

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Code Example

grep -E 'jailbreak|GODMODE|abliterate' ~/.hermes/.skills_prompt_snapshot.json

RAW_BUFFERClick to expand / collapse

Bundled red-team skill descriptions are sent to every fallback provider

Problem

The skill manifest (.skills_prompt_snapshot.json, built by agent/prompt_builder.py) is injected into the system prompt on every turn, so it goes to whatever provider serves that turn — including third-party APIs reached via fallback.

Two default-seeded builtins carry content-moderation trigger phrases in their description: frontmatter:

Skill	Path	Description
`godmode`	`skills/red-teaming/godmode`	`Jailbreak LLMs: Parseltongue, GODMODE, ULTRAPLINIAN.`
`obliteratus`	`skills/mlops/inference/obliteratus`	`OBLITERATUS: abliterate LLM refusals (diff-in-means).`

When the agent falls back to a provider like OpenAI, those phrases ship in the system prompt. Moderation classifiers score on input, so "Jailbreak LLMs" / "abliterate LLM refusals" can read as policy-violating content with no skill ever invoked.

Why it matters

This ships by default. Any install with a moderated third-party API anywhere in its fallback chain sends these phrases to that API on every fallback turn — no user action, no skill use. The realistic downside is an account warning or suspension on the fallback provider.

Reproduce

grep -E 'jailbreak|GODMODE|abliterate' ~/.hermes/.skills_prompt_snapshot.json

Returns matches on a default install — confirming the phrases are in the system prompt sent to whichever provider handles the turn.

Possible fixes (your call)

Make godmode / obliteratus opt-in instead of default-seeded. Narrowest, and questions whether red-team tooling should auto-install into every profile at all.
Reword the descriptions.
Strip or rewrite flagged descriptions when the target provider is a moderated third-party API. Covers future descriptions too, but most invasive.

Happy to PR whichever you prefer.

Note

This surfaced from a real OpenAI "Cyber Abuse" warning on an account using OpenAI subscription fallback. OpenAI later reversed it as incorrectly issued, and I can't 100% prove these phrases were the cause though it's the only reasonable option. This OpenAI Business subscription has only been used as a Hermes fallback. This fallback instance was the first time my OpenAI had been used in the past 3 days (confirmed). It also only used 3% of my 5 hour usage, 1% weekly. There was not many logs to comb through, and confidence is high here; However, this is still filed as a latent exposure, not a confirmed incident. The phrases shipping to a moderated API/OAuth by default is the issue regardless of whether they triggered that particular warning.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering