hermes - 💡(How to fix) Fix [Resilience] REASONING_SCRATCHPAD false positive detection triggers retry loop [2 comments, 2 participants]

hermes2026-04-17 13:26:34

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#11620•Fetched 2026-04-18 05:59:48

View on GitHub

Comments

Participants

Timeline

Reactions

Author

MuBeiGe

Participants

gongli0929

MuBeiGe

Timeline (top)

commented ×2

Error Message

Context contamination: When the conversation context contains code examples, error logs, or tool outputs that include the literal string <REASONING_SCRATCHPAD> (e.g., from searching source code), the model may reference this text in its response, triggering the detector.

Code Example

def has_incomplete_scratchpad(text: str) -> bool:
    import re
    cleaned = re.sub(r'

RAW_BUFFERClick to expand / collapse

Problem

The has_incomplete_scratchpad check in trajectory.py uses a simple string-contains check for <REASONING_SCRATCHPAD>. This causes false positives in several scenarios:

Context contamination: When the conversation context contains code examples, error logs, or tool outputs that include the literal string <REASONING_SCRATCHPAD> (e.g., from searching source code), the model may reference this text in its response, triggering the detector.
Retry loop: After the first false positive triggers a retry, the context does not change, so the model produces the same output, triggering another false positive. After verification_gate_retries attempts, the entire response is dropped.
No XML structure validation: The check only looks for the opening tag, not whether it is actually an incomplete XML block at the root level of the response.

Suggested Fix

Strip code blocks before checking, preventing false positives from quoted source code:

def has_incomplete_scratchpad(text: str) -> bool:
    import re
    cleaned = re.sub(r'```.*?```', '', text, flags=re.DOTALL)
    cleaned = re.sub(r'`[^`]*`', '', cleaned)
    opens = len(re.findall(r'<REASONING_SCRATCHPAD>', cleaned))
    closes = len(re.findall(r'</REASONING_SCRATCHPAD>', cleaned))
    return opens > closes

Environment

Hermes Agent (latest main)
Providers affected: zai (glm-5.1), MiniMax M2.7

Impact

Medium - causes intermittent response failures when context contains the literal string.

extent analysis

TL;DR

Implementing a more robust check for incomplete scratchpad, such as the suggested fix, can help mitigate false positives by properly handling context contamination and ensuring accurate XML structure validation.

Guidance

Modify the has_incomplete_scratchpad function to strip code blocks and validate XML structure, as suggested, to reduce false positives.
Verify the effectiveness of the fix by testing scenarios that previously triggered false positives, such as conversations with code examples or error logs containing the literal string <REASONING_SCRATCHPAD>.
Consider adding additional logging or monitoring to track the frequency of false positives before and after implementing the fix to measure its impact.
Review the verification_gate_retries mechanism to ensure it is appropriately configured to handle legitimate retries without dropping responses unnecessarily.

Example

The provided code snippet in the issue body demonstrates a potential solution:

def has_incomplete_scratchpad(text: str) -> bool:
    import re
    cleaned = re.sub(r'```.*?```', '', text, flags=re.DOTALL)
    cleaned = re.sub(r'`[^`]*`', '', cleaned)
    opens = len(re.findall(r'<REASONING_SCRATCHPAD>', cleaned))
    closes = len(re.findall(r'</REASONING_SCRATCHPAD>', cleaned))
    return opens > closes

This example strips code blocks and checks for the presence of opening and closing tags to determine if a scratchpad is incomplete.

Notes

The suggested fix assumes that the <REASONING_SCRATCHPAD> tag is always properly formatted and that the issue is primarily caused by context contamination and lack of XML structure validation. Additional testing may be necessary to ensure the fix does not introduce new issues or affect other parts of the system.

Recommendation

Apply the suggested workaround by modifying the has_incomplete_scratchpad function as described, to mitigate false positives and improve the overall reliability of the response generation system. This approach directly addresses the identified causes of the issue and provides a clear, step-by-step solution to enhance the system's performance.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#environment setup #docker error #permission error #memory optimization #batch processing

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Resilience] REASONING_SCRATCHPAD false positive detection triggers retry loop [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Problem

Suggested Fix

Environment

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [Resilience] REASONING_SCRATCHPAD false positive detection triggers retry loop [2 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Code Example

Problem

Suggested Fix

Environment

Impact

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING