hermes - ✅(Solved) Fix Compressed sessions with corrupted tool_calls.arguments JSON brick chats with HTTP 400 (invalid_tool_call_format) [2 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#15236Fetched 2026-04-25 06:23:32
View on GitHub
Comments
1
Participants
2
Timeline
13
Reactions
0
Author
Participants
Timeline (top)
labeled ×6referenced ×3cross-referenced ×2closed ×1

After context compression splits a long-running session, the newly-created child session can contain assistant messages whose tool_calls[*].function.arguments field is a string that is not valid JSON (typically truncated mid-string). Every subsequent API call replays this poisoned history, and strict-validating providers (Copilot endpoint, https://api.githubcopilot.com) reject the entire request with:

HTTP 400: {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

The session is now permanently broken — every inbound user message produces the same 400, and the gateway falls back to a canned error reply (⚠️ Non-retryable error (HTTP 400) — trying fallback...). The only recovery is manual: stop the gateway, quarantine the session JSONL, delete the session-store mapping, and restart. I have hit this 4 times in 24 hours on the same Feishu DM session.

Error Message

Non-retryable client error: Error code: 400 - {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

Root Cause

Root cause (analysis)

Fix Action

Fixed

PR fix notes

PR #15241: fix(run_agent): repair corrupted tool_call arguments before sending to provider

Description (problem / solution / changelog)

Summary

Fixes #15236.

When a session is split by context compression mid-tool-call, an assistant message can end up with truncated/invalid JSON in tool_calls[*].function.arguments. On the next turn this is replayed verbatim and the provider rejects the entire request with HTTP 400 invalid_tool_call_format, bricking the conversation in a loop that cannot recover without manual session quarantine.

In real usage this hit the same DM 4 times in a single day; each recurrence required stopping the gateway, quarantining session files, scrubbing sessions.json, and restarting — clearly a symptom-only fix.

Root cause

run_agent.py :: AIAgent.run_conversation() builds api_messages from the in-memory transcript and forwards tool_calls[*].function.arguments straight to client.chat.completions.create(). Strict providers (Copilot, OpenAI, etc.) parse those arguments as JSON and 400 the entire request when any of them is malformed. There is no defensive validation between the in-memory state and the wire.

Fix

Adds a defensive sanitizer that runs immediately before the provider call:

  • Iterates assistant messages with tool_calls
  • json.loads-validates each function.arguments
  • Replaces invalid / empty / None arguments with "{}"
  • Injects a synthetic role="tool" response (or prepends a marker to the existing matching one) so tool_call_id pairing stays valid for strict APIs
  • Logs each repair with session_id / message_index / tool_call_id / function / preview for observability
  • Returns repair count; an INFO-level summary line is logged when > 0

This is defense in depth — corruption can originate from compression splits, manual edits, plugin bugs, or partial streaming writes. Sanitizing at the single send chokepoint catches all sources without invasive changes.

Tests

Adds tests/run_agent/test_tool_call_args_sanitizer.py with 7 unit tests:

  • truncated JSON repaired and synthetic tool response injected
  • existing matching tool response gets marker prepended (no duplicate insertion)
  • multiple bad calls in one assistant message
  • valid JSON left untouched
  • None arguments normalized to "{}" (silent — not corruption)
  • empty string arguments normalized to "{}" (silent — not corruption)
  • non-assistant messages and non-dict entries ignored
pytest tests/run_agent/test_tool_call_args_sanitizer.py -v
  → 7 passed, 0 failed

pytest tests/run_agent/ -q --ignore=tests/integration --ignore=tests/e2e
  → 989 passed, 7 skipped, 0 failed

Risk

Low — pure additive code on the request egress path:

  • No change to compression, session storage, or message construction
  • Idempotent: re-running on already-clean messages is a no-op
  • Strict equivalence for valid input (only mutates when JSON parse fails or value is None/empty)
  • No new dependencies

Files

  • run_agent.py+127 / -0
  • tests/run_agent/test_tool_call_args_sanitizer.py+157 / -0 (new)

Changed files

  • run_agent.py (modified, +127/-0)
  • tests/run_agent/test_tool_call_args_sanitizer.py (added, +157/-0)

PR #15348: fix(run_agent): persist tool_call argument repairs into session history

Description (problem / solution / changelog)

Salvage of #15241 by @luyao618 — cherry-picked unchanged onto current main.

Summary

Sanitizes assistant tool_calls[*].function.arguments in-place on messages (not just the ephemeral api_messages copy) right before each request. Malformed JSON args get replaced with "{}", a synthetic role="tool" response is injected for tool_call_id pairing, and the repair persists into session history so the next turn doesn't re-send the same bad state.

Why this isn't redundant with the existing sanitizer

run_agent.py already calls _repair_tool_call_arguments() at ~line 9468 but only on api_messages (the per-request copy). The in-memory messages keep the corruption, so next turn rebuilds api_messages with the same bad args and the loop repeats until manual reset. This PR fixes the persistence gap — the 400-loop root cause for #15236.

Changes

  • run_agent.py: new AIAgent._sanitize_tool_call_arguments() static method + invocation in run_conversation()
  • tests/run_agent/test_tool_call_args_sanitizer.py: 7 unit tests

Validation

tests/run_agent/test_tool_call_args_sanitizer.py → 7/7 passed

Closes #15236.

Changed files

  • run_agent.py (modified, +127/-0)
  • tests/run_agent/test_tool_call_args_sanitizer.py (added, +157/-0)

Code Example

HTTP 400: {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

---

Session split detected: <parent_sid><child_sid> (compression)

---

Non-retryable client error: Error code: 400 - {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}
RAW_BUFFERClick to expand / collapse

Summary

After context compression splits a long-running session, the newly-created child session can contain assistant messages whose tool_calls[*].function.arguments field is a string that is not valid JSON (typically truncated mid-string). Every subsequent API call replays this poisoned history, and strict-validating providers (Copilot endpoint, https://api.githubcopilot.com) reject the entire request with:

HTTP 400: {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}

The session is now permanently broken — every inbound user message produces the same 400, and the gateway falls back to a canned error reply (⚠️ Non-retryable error (HTTP 400) — trying fallback...). The only recovery is manual: stop the gateway, quarantine the session JSONL, delete the session-store mapping, and restart. I have hit this 4 times in 24 hours on the same Feishu DM session.

Reproduction

  1. Run the gateway with a long-lived chat (DM via any platform).
  2. Let the session grow until compression triggers a split (you will see in gateway.log):
    Session split detected: <parent_sid> → <child_sid> (compression)
  3. Send another message. If the compression happened to truncate a tool_calls[*].function.arguments JSON string in the new session's history, the next provider call returns:
    Non-retryable client error: Error code: 400 - {'error': {'message': 'Invalid JSON format in tool call arguments', 'code': 'invalid_tool_call_format'}}
  4. Every subsequent message in the same chat replays the bad history → same 400 forever.

Root cause (analysis)

The compression / session-split path can produce assistant messages where tool_calls[*].function.arguments is invalid JSON (the spec requires it to be a serialized JSON string). Nothing on the gateway / run_agent send-path validates this invariant before forwarding to the provider. Strict providers (Copilot) reject the whole request; lax providers might silently swallow it, but the data is still corrupt.

This is the "poison session" pattern — a single bad assistant message persists in the JSONL and bricks the chat indefinitely until manual surgery.

Proposed fix

Add a defensive JSON-arguments validator on the message-send path (run_agent.py / agent loop, before client.chat.completions.create):

  1. Iterate the outbound messages; for every assistant message with tool_calls, run json.loads() on each tool_calls[*].function.arguments.
  2. If any fail to parse, take a remediation action — preferred: log a WARNING with session id + tool name + offending message index, replace the offending arguments with "{}" and append a synthetic tool message indicating the call was dropped due to corruption, so the conversation can continue. Optional secondary action: emit a metric / structured warning so operators see this happening.
  3. Continue with the (now valid) request instead of letting the provider 400 the whole conversation.

This stops the bleeding regardless of which upstream code path produces the bad JSON. A separate follow-up can then investigate why compression occasionally truncates arguments in the first place.

Environment

  • macOS (Darwin 25.4.0, Apple Silicon), Python 3.11
  • Provider: Copilot, model claude-opus-4.7, endpoint https://api.githubcopilot.com
  • Platform: Feishu DM gateway
  • Reproduced 4 times in <24 hours (same chat, fresh session each time after recovery)

extent analysis

TL;DR

Add a JSON-arguments validator on the message-send path to prevent invalid JSON from being sent to providers.

Guidance

  • Implement a defensive validation check in run_agent.py to ensure tool_calls[*].function.arguments is valid JSON before sending the request to the provider.
  • If invalid JSON is found, replace the offending arguments with a valid JSON string (e.g., "{}") and append a synthetic tool message to indicate the call was dropped due to corruption.
  • Consider emitting a metric or structured warning to alert operators of the issue.
  • Verify the fix by reproducing the scenario and checking that the conversation continues without errors.

Example

import json

# ...

for message in outbound_messages:
    if message['type'] == 'assistant' and 'tool_calls' in message:
        for tool_call in message['tool_calls']:
            try:
                json.loads(tool_call['function']['arguments'])
            except json.JSONDecodeError:
                # Replace offending arguments with valid JSON and append synthetic message
                tool_call['function']['arguments'] = '{}'
                message['tool_calls'].append({
                    'type': 'tool',
                    'name': 'corruption_handler',
                    'function': {
                        'name': 'drop_call',
                        'arguments': {}
                    }
                })

Notes

This fix addresses the immediate issue of invalid JSON being sent to providers, but a separate investigation should be conducted to determine why compression is occasionally truncating arguments in the first place.

Recommendation

Apply the proposed fix to add a defensive JSON-arguments validator on the message-send path, as it prevents the "poison session" pattern and allows conversations to continue without errors.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Compressed sessions with corrupted tool_calls.arguments JSON brick chats with HTTP 400 (invalid_tool_call_format) [2 pull requests, 1 comments, 2 participants]