hermes - ✅(Solved) Fix Bug: Tool call ID not invalidated after 400 error — causes persistent failures across sessions [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16472Fetched 2026-04-28 06:53:07
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×2

When a tool call fails with an HTTP 400 error from MiniMax API (minimax-cn provider), the returned tool_call_id is not properly invalidated. Subsequent new sessions continue to reuse the same invalidated tool_call_id, causing every request to fail with:

Error code: 400 - {'type': 'error', 'error': {'type': 'bad_request_error', 'message': 'invalid params, invalid function arguments json string, tool_call_id: call_function_XXXXX_1 (2013)', 'http_code': '400'}

Error Message

When a tool call fails with an HTTP 400 error from MiniMax API (minimax-cn provider), the returned tool_call_id is not properly invalidated. Subsequent new sessions continue to reuse the same invalidated tool_call_id, causing every request to fail with: Error code: 400 - {'type': 'error', 'error': {'type': 'bad_request_error', 'message': 'invalid params, invalid function arguments json string, tool_call_id: call_function_XXXXX_1 (2013)', 'http_code': '400'} 2. Trigger any tool call that produces a 400 error (e.g., malformed parameters) When a tool call returns a 400 error, Hermes should:

Root Cause

When a tool call fails with an HTTP 400 error from MiniMax API (minimax-cn provider), the returned tool_call_id is not properly invalidated. Subsequent new sessions continue to reuse the same invalidated tool_call_id, causing every request to fail with:

Error code: 400 - {'type': 'error', 'error': {'type': 'bad_request_error', 'message': 'invalid params, invalid function arguments json string, tool_call_id: call_function_XXXXX_1 (2013)', 'http_code': '400'}

Fix Action

Fixed

PR fix notes

PR #16602: fix(agent): invalidate failed tool_call_id on MiniMax 400 errors

Description (problem / solution / changelog)

Closes #16472


Problem

When MiniMax API (minimax-cn) returns HTTP 400 for invalid tool call arguments, the failed tool_call_id (format: call_function_XXXXX_1) is persisted to the session database. Subsequent sessions inherit these messages with stale tool_call_ids, causing persistent failures until Hermes is restarted.

Changes

  • File: run_agent.py
  • Added _invalidate_failed_tool_calls() helper method (lines 3401-3418)
    • Only activates for minimax-cn provider
    • Detects 400 errors containing tool_call_id or invalid function arguments
    • Removes tool_call_id fields from all tool-role messages before persistence
  • Modified 400 error handling path (line 11721-11722)
    • Calls helper before _persist_session() when status code is 400
    • Prevents failed tool_call_ids from being saved to session database

Impact

  • Fixes cross-session tool_call_id pollution for MiniMax provider
  • No impact on other providers or normal operation
  • Minimal, focused change — only affects 400 error persistence path

Changed files

  • run_agent.py (modified, +20/-0)

PR #16604: fix(error_classifier): purge stale tool_call_id on 400 to prevent cross-session reuse (#16472)

Description (problem / solution / changelog)

Summary

Fixes #16472 — when a provider rejects a tool_call_id with HTTP 400 (most often MiniMax invalid params, invalid function arguments json string, tool_call_id: call_function_X_1 (2013)), Hermes was persisting the broken assistant turn into state.db anyway. Every subsequent session that inherited the same conversation_history replayed the same id and hit the same 400 indefinitely — until the gateway was restarted.

Root cause

run_agent.py (around if status_code == 400 and (approx_tokens > 50000 or len(api_messages) > 80)) only skips persistence when the 400 is a context-overflow shaped failure. A small-session 400 (which is exactly what an invalid tool_call_id produces — short payload, structured error) falls through to _persist_session(messages, conversation_history) and writes the broken assistant turn to state.db.

That turn — assistant message with tool_calls=[{ id: "call_function_X_1", ... }] — is then loaded by every new session that inherits the same conversation history, so the provider keeps seeing the same id and keeps 400-ing. Restarting the gateway is the only way out today.

Reporter evidence:

  • 2026-04-26 → call_function_18464z1gskgy_1 reused across ~5 sessions
  • 2026-04-26 → call_function_bgfd4km3qnvm_1 reused across ~3 sessions
  • 2026-04-27 → call_function_41ojma24zxdp_1 reused across 4+ sessions

Fix

1. Classify the error

agent/error_classifier.py:

  • New FailoverReason.invalid_tool_call_id.
  • New ClassifiedError.purge_tool_call_id: Optional[str] carrying the offending id.
  • New _INVALID_TOOL_CALL_ID_PATTERNS (covers MiniMax invalid function arguments, OpenAI no tool call with id, etc.) plus a tolerant regex extractor extract_invalid_tool_call_id() that handles the various formats providers use (tool_call_id: X, tool_call_id X, 'tool_call_id': 'X', no tool call with id X found).
  • The branch runs before the generic format_error fallback inside _classify_400, so the cheaper recovery (purge + retry) wins over fallback to a different provider.
  • A case-preserved companion to error_msg is built once at the top of classify_api_error and threaded through, so case-sensitive ids (e.g. call_function_BgFD4kM3qNvm_1) survive the lowercasing the rest of the classifier relies on.

2. Purge the stale call

run_agent.py:

  • New static AIAgent._purge_invalid_tool_call_id(messages, invalid_id) walks the messages list and:
    • Drops every role=='tool' message whose tool_call_id matches.
    • Strips the matching entry from each assistant tool_calls list (handles both dict-style and SDK-object-style entries).
    • Drops an assistant turn outright when the bad call was its only call and it had no textual content; otherwise preserves the surviving content and surviving calls.

3. Retry without consuming retry_count

When classified.reason == invalid_tool_call_id and classified.purge_tool_call_id is set, the retry loop:

  1. Purges the id from both messages and api_messages.
  2. Persists the cleaned history immediately so a crash-restart cannot resurrect the bad id (this is the key step the original code missed).
  3. continues without consuming retry_count.

Behavior

ScenarioBeforeAfter
400 invalid function arguments... tool_call_id: XPersist broken turn → next session 400s on same idStrip id, persist clean state, retry succeeds
400 context overflowCompress + retry (unchanged)Compress + retry (unchanged)
400 generic format errorAbort / fallback (unchanged)Abort / fallback (unchanged)
400 mentions tool_call_id field missing (no id)Generic format_errorGeneric format_error (extractor returns None, falls through)

Tests

13 new tests, all green. Existing 2049 tests still pass; the 9 failures in test_bedrock_adapter.py and test_anthropic_adapter.py are pre-existing on main (unrelated to this PR — verified by git stash).

tests/agent/test_error_classifier.py — 6 new:

  • MiniMax body shape is classified as invalid_tool_call_id
  • Mixed-case id (call_function_BgFD4kM3qNvm_1) is extracted with original casing
  • Unrelated 400 (e.g. parameter 'temperature' out of range) keeps format_error
  • Pattern match without an extractable id falls through to format_error (so a vague mention doesn't cause an empty-purge spin)
  • OpenAI no tool call with id call_xyz_42 found is classified the same way
  • invalid_tool_call_id branch wins over context_overflow when the message could match both

tests/run_agent/test_agent_guardrails.py — 7 new:

  • Drops matching tool_result message
  • Strips only the bad call when assistant has multiple
  • Drops assistant turn when the bad call was its only call and content is empty
  • Keeps assistant turn when the bad call was its only call but content is present
  • No-op when id is missing from history
  • Handles SDK-object-style tool_calls (SimpleNamespace(id=..., function=...))
  • Empty / None inputs are safe
$ pytest tests/agent/test_error_classifier.py tests/run_agent/test_agent_guardrails.py -q
160 passed in 5.03s

Changed files

  • agent/error_classifier.py (modified, +117/-0)
  • run_agent.py (modified, +112/-0)
  • tests/agent/test_error_classifier.py (modified, +119/-0)
  • tests/run_agent/test_agent_guardrails.py (modified, +107/-0)

Code Example

Error code: 400 - {'type': 'error', 'error': {'type': 'bad_request_error', 'message': 'invalid params, invalid function arguments json string, tool_call_id: call_function_XXXXX_1 (2013)', 'http_code': '400'}
RAW_BUFFERClick to expand / collapse

Description

When a tool call fails with an HTTP 400 error from MiniMax API (minimax-cn provider), the returned tool_call_id is not properly invalidated. Subsequent new sessions continue to reuse the same invalidated tool_call_id, causing every request to fail with:

Error code: 400 - {'type': 'error', 'error': {'type': 'bad_request_error', 'message': 'invalid params, invalid function arguments json string, tool_call_id: call_function_XXXXX_1 (2013)', 'http_code': '400'}

Evidence

The same pattern has reproduced multiple times across different dates:

  • 2026-04-26: call_function_18464z1gskgy_1 → 400, reused across ~5 sessions
  • 2026-04-26: call_function_bgfd4km3qnvm_1 → 400, reused across ~3 sessions
  • 2026-04-27: call_function_41ojma24zxdp_1 → 400, reused across 4+ sessions

Each time a new tool_call_id appears on first failure, it then gets "stuck" and reused in all subsequent sessions until the Hermes service is restarted.

Steps to Reproduce

  1. Use MiniMax provider (minimax-cn) with tools enabled
  2. Trigger any tool call that produces a 400 error (e.g., malformed parameters)
  3. Observe that all subsequent sessions also fail with the same tool_call_id

Expected Behavior

When a tool call returns a 400 error, Hermes should:

  • Invalidate that tool_call_id
  • Generate a fresh tool_call_id for subsequent requests
  • Not persist or cache the failed ID across sessions

Environment

  • Provider: MiniMax (minimax-cn)
  • Model: MiniMax-M2.7
  • Base URL: https://api.minimaxi.com/v1
  • Platform: Docker

extent analysis

TL;DR

The issue can be mitigated by ensuring that the tool_call_id is properly invalidated and a new one is generated after a 400 error from the MiniMax API.

Guidance

  • Verify that the tool_call_id invalidation logic is correctly implemented and triggered upon receiving a 400 error from the MiniMax API.
  • Check the session management code to ensure that the tool_call_id is not being cached or persisted across sessions.
  • Review the error handling mechanism to guarantee that a fresh tool_call_id is generated for subsequent requests after an error.
  • Consider adding logging to track the tool_call_id usage and invalidation to better understand the issue.

Example

No code snippet can be provided without more context, but ensuring proper tool_call_id management might involve something like:

if response.status_code == 400:
    invalidate_tool_call_id(current_tool_call_id)
    current_tool_call_id = generate_new_tool_call_id()

Notes

The root cause seems to be related to the tool_call_id management, but without more information about the implementation, it's difficult to provide a definitive fix. The issue might be specific to the MiniMax provider or the Hermes service configuration.

Recommendation

Apply a workaround by manually invalidating the tool_call_id and generating a new one after a 400 error, as this seems to be the most direct way to address the issue given the provided information.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix Bug: Tool call ID not invalidated after 400 error — causes persistent failures across sessions [2 pull requests, 1 participants]