hermes - ✅(Solved) Fix Compressed sessions with corrupted tool_calls.arguments JSON brick chats with HTTP 400 (invalid_tool_call_format) [2 pull requests, 1 comments, 2 participants]

hermes2026-04-24 16:17:57

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#15236•Fetched 2026-04-25 06:23:32

View on GitHub

Comments

Participants

Timeline

Reactions

Author

luyao618

Participants

alt-glitch

luyao618

Timeline (top)

labeled ×6referenced ×3cross-referenced ×2closed ×1

After context compression splits a long-running session, the newly-created child session can contain assistant messages whose tool_calls[*].function.arguments field is a string that is not valid JSON (typically truncated mid-string). Every subsequent API call replays this poisoned history, and strict-validating providers (Copilot endpoint, https://api.githubcopilot.com) reject the entire request with:

HTTP 400: {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

The session is now permanently broken — every inbound user message produces the same 400, and the gateway falls back to a canned error reply (⚠️ Non-retryable error (HTTP 400) — trying fallback...). The only recovery is manual: stop the gateway, quarantine the session JSONL, delete the session-store mapping, and restart. I have hit this 4 times in 24 hours on the same Feishu DM session.

Error Message

Non-retryable client error: Error code: 400 - {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

Root Cause

Root cause (analysis)

Fix Action

Fixed

Fixed by PR: fix(run_agent): repair corrupted tool_call arguments before sending to provider (https://github.com/NousResearch/hermes-agent/pull/15241)
Fixed by PR: fix(run_agent): persist tool_call argument repairs into session history (https://github.com/NousResearch/hermes-agent/pull/15348)

PR fix notes

PR #15241: fix(run_agent): repair corrupted tool_call arguments before sending to provider

Repository: NousResearch/hermes-agent
Author: luyao618
State: closed | merged: False
Link: https://github.com/NousResearch/hermes-agent/pull/15241

Description (problem / solution / changelog)

Summary

Fixes #15236.

When a session is split by context compression mid-tool-call, an assistant message can end up with truncated/invalid JSON in tool_calls[*].function.arguments. On the next turn this is replayed verbatim and the provider rejects the entire request with HTTP 400 invalid_tool_call_format, bricking the conversation in a loop that cannot recover without manual session quarantine.

In real usage this hit the same DM 4 times in a single day; each recurrence required stopping the gateway, quarantining session files, scrubbing sessions.json, and restarting — clearly a symptom-only fix.

Root cause

run_agent.py :: AIAgent.run_conversation() builds api_messages from the in-memory transcript and forwards tool_calls[*].function.arguments straight to client.chat.completions.create(). Strict providers (Copilot, OpenAI, etc.) parse those arguments as JSON and 400 the entire request when any of them is malformed. There is no defensive validation between the in-memory state and the wire.

Fix

Adds a defensive sanitizer that runs immediately before the provider call:

Iterates assistant messages with tool_calls
json.loads-validates each function.arguments
Replaces invalid / empty / None arguments with "{}"
Injects a synthetic role="tool" response (or prepends a marker to the existing matching one) so tool_call_id pairing stays valid for strict APIs
Logs each repair with session_id / message_index / tool_call_id / function / preview for observability
Returns repair count; an INFO-level summary line is logged when > 0

This is defense in depth — corruption can originate from compression splits, manual edits, plugin bugs, or partial streaming writes. Sanitizing at the single send chokepoint catches all sources without invasive changes.

Tests

Adds tests/run_agent/test_tool_call_args_sanitizer.py with 7 unit tests:

truncated JSON repaired and synthetic tool response injected
existing matching tool response gets marker prepended (no duplicate insertion)
multiple bad calls in one assistant message
valid JSON left untouched
None arguments normalized to "{}" (silent — not corruption)
empty string arguments normalized to "{}" (silent — not corruption)
non-assistant messages and non-dict entries ignored

pytest tests/run_agent/test_tool_call_args_sanitizer.py -v
  → 7 passed, 0 failed

pytest tests/run_agent/ -q --ignore=tests/integration --ignore=tests/e2e
  → 989 passed, 7 skipped, 0 failed

Risk

Low — pure additive code on the request egress path:

No change to compression, session storage, or message construction
Idempotent: re-running on already-clean messages is a no-op
Strict equivalence for valid input (only mutates when JSON parse fails or value is None/empty)
No new dependencies

Files

run_agent.py — +127 / -0
tests/run_agent/test_tool_call_args_sanitizer.py — +157 / -0 (new)

Changed files

run_agent.py (modified, +127/-0)
tests/run_agent/test_tool_call_args_sanitizer.py (added, +157/-0)

PR #15348: fix(run_agent): persist tool_call argument repairs into session history

Repository: NousResearch/hermes-agent
Author: teknium1
State: closed | merged: True
Link: https://github.com/NousResearch/hermes-agent/pull/15348

Description (problem / solution / changelog)

Salvage of #15241 by @luyao618 — cherry-picked unchanged onto current main.

Summary

Sanitizes assistant tool_calls[*].function.arguments in-place on messages (not just the ephemeral api_messages copy) right before each request. Malformed JSON args get replaced with "{}", a synthetic role="tool" response is injected for tool_call_id pairing, and the repair persists into session history so the next turn doesn't re-send the same bad state.

Why this isn't redundant with the existing sanitizer

run_agent.py already calls _repair_tool_call_arguments() at ~line 9468 but only on api_messages (the per-request copy). The in-memory messages keep the corruption, so next turn rebuilds api_messages with the same bad args and the loop repeats until manual reset. This PR fixes the persistence gap — the 400-loop root cause for #15236.

Changes

run_agent.py: new AIAgent._sanitize_tool_call_arguments() static method + invocation in run_conversation()
tests/run_agent/test_tool_call_args_sanitizer.py: 7 unit tests

Validation

tests/run_agent/test_tool_call_args_sanitizer.py → 7/7 passed

Closes #15236.

Changed files

run_agent.py (modified, +127/-0)
tests/run_agent/test_tool_call_args_sanitizer.py (added, +157/-0)

Code Example

HTTP 400: {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

---

Session split detected: <parent_sid> → <child_sid> (compression)

---

Non-retryable client error: Error code: 400 - {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

RAW_BUFFERClick to expand / collapse

Summary

HTTP 400: {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

Reproduction

Run the gateway with a long-lived chat (DM via any platform).
Let the session grow until compression triggers a split (you will see in gateway.log):
```
Session split detected: <parent_sid> → <child_sid> (compression)
```
Send another message. If the compression happened to truncate a tool_calls[*].function.arguments JSON string in the new session's history, the next provider call returns:
```
Non-retryable client error: Error code: 400 - {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}
```
Every subsequent message in the same chat replays the bad history → same 400 forever.

Root cause (analysis)

The compression / session-split path can produce assistant messages where tool_calls[*].function.arguments is invalid JSON (the spec requires it to be a serialized JSON string). Nothing on the gateway / run_agent send-path validates this invariant before forwarding to the provider. Strict providers (Copilot) reject the whole request; lax providers might silently swallow it, but the data is still corrupt.

This is the "poison session" pattern — a single bad assistant message persists in the JSONL and bricks the chat indefinitely until manual surgery.

Proposed fix

Add a defensive JSON-arguments validator on the message-send path (run_agent.py / agent loop, before client.chat.completions.create):

Iterate the outbound messages; for every assistant message with tool_calls, run json.loads() on each tool_calls[*].function.arguments.
If any fail to parse, take a remediation action — preferred: log a WARNING with session id + tool name + offending message index, replace the offending arguments with "{}" and append a synthetic tool message indicating the call was dropped due to corruption, so the conversation can continue. Optional secondary action: emit a metric / structured warning so operators see this happening.
Continue with the (now valid) request instead of letting the provider 400 the whole conversation.

This stops the bleeding regardless of which upstream code path produces the bad JSON. A separate follow-up can then investigate why compression occasionally truncates arguments in the first place.

Environment

macOS (Darwin 25.4.0, Apple Silicon), Python 3.11
Provider: Copilot, model claude-opus-4.7, endpoint https://api.githubcopilot.com
Platform: Feishu DM gateway
Reproduced 4 times in <24 hours (same chat, fresh session each time after recovery)

extent analysis

TL;DR

Add a JSON-arguments validator on the message-send path to prevent invalid JSON from being sent to providers.

Guidance

Implement a defensive validation check in run_agent.py to ensure tool_calls[*].function.arguments is valid JSON before sending the request to the provider.
If invalid JSON is found, replace the offending arguments with a valid JSON string (e.g., "{}") and append a synthetic tool message to indicate the call was dropped due to corruption.
Consider emitting a metric or structured warning to alert operators of the issue.
Verify the fix by reproducing the scenario and checking that the conversation continues without errors.

Example

import json

# ...

for message in outbound_messages:
    if message['type'] == 'assistant' and 'tool_calls' in message:
        for tool_call in message['tool_calls']:
            try:
                json.loads(tool_call['function']['arguments'])
            except json.JSONDecodeError:
                # Replace offending arguments with valid JSON and append synthetic message
                tool_call['function']['arguments'] = '{}'
                message['tool_calls'].append({
                    'type': 'tool',
                    'name': 'corruption_handler',
                    'function': {
                        'name': 'drop_call',
                        'arguments': {}
                    }
                })

Notes

This fix addresses the immediate issue of invalid JSON being sent to providers, but a separate investigation should be conducted to determine why compression is occasionally truncating arguments in the first place.

Recommendation

Apply the proposed fix to add a defensive JSON-arguments validator on the message-send path, as it prevents the "poison session" pattern and allows conversations to continue without errors.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #optimization #mixed precision #training loop #device allocation

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

hermes - ✅(Solved) Fix Compressed sessions with corrupted tool_calls.arguments JSON brick chats with HTTP 400 (invalid_tool_call_format) [2 pull requests, 1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Root Cause

Root cause (analysis)

Fix Action

Fixed

PR fix notes

PR #15241: fix(run_agent): repair corrupted tool_call arguments before sending to provider

Description (problem / solution / changelog)

Summary

Root cause

Fix

Tests

Risk

Files

Changed files

PR #15348: fix(run_agent): persist tool_call argument repairs into session history

Description (problem / solution / changelog)

Summary

Why this isn't redundant with the existing sanitizer

Changes

Validation

Changed files

Code Example

Summary

Reproduction

Root cause (analysis)

Proposed fix

Environment

extent analysis

TL;DR

Guidance

Example

Notes

Recommendation

Still need to ship something?

RELATED_DISCOVERY

TRENDING