hermes - ✅(Solved) Fix JSONDecodeError misclassified as local validation error causes non-retryable abort (HTTP None) [3 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#14271Fetched 2026-04-23 07:45:53
View on GitHub
Comments
0
Participants
1
Timeline
10
Reactions
0
Participants
Timeline (top)
cross-referenced ×4labeled ×3referenced ×3

Error Message

⚠️ API call failed (attempt 1/3): JSONDecodeError 📝 Error: Expecting value: line 1 column 1 (char 0) ⚠️ Non-retryable error (HTTP None) — trying fallback... ❌ Non-retryable client error (HTTP None). Aborting.

Root Cause

Because json.JSONDecodeError subclasses ValueError, current logic in run_agent.py classifies it as is_local_validation_error, which bypasses normal retry handling. This causes premature aborts for recoverable upstream/provider parse failures (e.g., empty or non-JSON transient responses).

Fix Action

Fix / Workaround

Deterministic (unit-style repro)

  1. Instantiate AIAgent.
  2. Patch _interruptible_api_call with side effects:
    • first call: json.JSONDecodeError("Expecting value", "", 0)
    • second call: valid completion response
  3. Call run_conversation("hello").

PR fix notes

PR #14293: fix(agent): retry provider JSON decode failures

Description (problem / solution / changelog)

Summary

  • stop treating provider-side JSONDecodeError exceptions as local validation failures
  • keep JSON decode failures on the normal retry path instead of aborting immediately
  • add a regression test that verifies a JSON decode failure can recover on retry

Testing

  • python3 -m pytest -o addopts='' tests/run_agent/test_anthropic_error_handling.py

Fixes #14271

Changed files

  • run_agent.py (modified, +4/-1)
  • tests/run_agent/test_anthropic_error_handling.py (modified, +34/-0)

PR #2: fix: apply P1 issue hardening (todo/registry/retry)

Description (problem / solution / changelog)

결론

GitHub P1 오픈 이슈 4건의 핵심 재현 케이스를 테스트로 먼저 추가하고, 최소 수정으로 런타임 방어 로직을 반영했습니다.

반영 이슈

변경 요약

  1. todo_tool 하드닝 (#14185)
  • todos가 문자열이면 JSON 파싱 시도
  • 파싱 실패 시 구조화 에러 반환
  • list 타입 검증 추가
  • _validate, _dedupe_by_id에서 non-dict 입력 방어
  1. registry dispatch 하드닝 (#14186)
  • _normalize_tool_name() 추가
  • CamelCase, _tool suffix, 구분자 드리프트를 정규화 fallback으로 처리
  • 예: TodoTool_tool -> todo, WriteFile_tool -> write_file
  1. run_agent 분류 수정 (#14271)
  • local validation fast-fail 분류에서 json.JSONDecodeError 제외
  • transient JSON 파싱 실패가 retry 경로를 타도록 보정
  1. error classifier 보강 (#14195)
  • _extract_error_body()__cause__/__context__ 체인을 따라 nested body를 추출하도록 수정
  • wrapped 402 오류에서 nested message("usage limit ... try again")를 잃지 않도록 보정
  • billing 오분류를 줄이고 transient rate_limit 분류 정확도 개선

테스트

  • scripts/run_tests.sh tests/tools/test_todo_tool.py tests/tools/test_registry.py tests/run_agent/test_run_agent.py::TestRetryExhaustion::test_jsondecode_error_is_retried_not_treated_as_local_validation
  • scripts/run_tests.sh tests/agent/test_error_classifier.py tests/tools/test_todo_tool.py tests/tools/test_registry.py tests/run_agent/test_run_agent.py::TestRetryExhaustion::test_jsondecode_error_is_retried_not_treated_as_local_validation

모두 통과했습니다.

비고

  • aideautomation/aide_hermes_agent 저장소는 코드 구조가 현재 hermes-agent와 상이하여 동일 커밋 체리픽이 불가했습니다.
  • 실행 가능한 반영은 aideautomation/hermes-agent 포크 기준으로 완료했습니다.

Changed files

  • agent/error_classifier.py (modified, +24/-13)
  • run_agent.py (modified, +1/-1)
  • tests/agent/test_error_classifier.py (modified, +23/-0)
  • tests/run_agent/test_run_agent.py (modified, +20/-0)
  • tests/tools/test_registry.py (modified, +24/-0)
  • tests/tools/test_todo_tool.py (modified, +23/-0)
  • tools/registry.py (modified, +43/-1)
  • tools/todo_tool.py (modified, +16/-2)

PR #14366: fix: exclude json.JSONDecodeError from local validation error check

Description (problem / solution / changelog)

Problem

Closes #14271

json.JSONDecodeError inherits from ValueError. When the OpenAI SDK fails to parse a provider response (empty body, truncated JSON, garbled bytes), it raises json.JSONDecodeError. The is_local_validation_error check in run_agent.py catches it as isinstance(api_error, ValueError) and triggers a non-retryable abort — but the error is transient (provider-side), not a programming bug.

The error classifier (agent/error_classifier.py) already correctly returns FailoverReason.unknown, retryable=True for these errors. The bug is the inline isinstance check that overrides the classifier.

Secondary bug found

UnicodeDecodeError (also a ValueError subclass) from garbled provider responses gets the same wrong treatment. The existing code only excluded UnicodeEncodeError but a garbled response body can also cause UnicodeDecodeError.

Fix

Two changes in run_agent.py line ~10681:

Before:

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, UnicodeEncodeError)
)

After:

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, (UnicodeError, json.JSONDecodeError))
)

Using UnicodeError (parent class) instead of just UnicodeEncodeError covers UnicodeDecodeError and UnicodeTranslateError as well — all three indicate provider-side transport issues, not local programming bugs.

Testing

7 new tests in tests/test_json_decode_error_misclassification.py:

  • 4 RED tests (would fail on old code): json.JSONDecodeError and UnicodeDecodeError must NOT be classified as local validation errors; classifier must return retryable=True
  • 3 GREEN tests (sanity): plain ValueError, TypeError, and UnicodeEncodeError still correctly flagged

All 111 existing classifier tests pass unchanged.

Impact

Low risk, high value. The change only affects the is_local_validation_error boolean — the classifier pipeline and all other recovery paths are untouched. Genuine ValueError/TypeError from coding bugs still trigger non-retryable abort. Only provider-originated ValueError subclasses now correctly fall through to the retryable recovery path.

Changed files

  • run_agent.py (modified, +14/-1)
  • tests/test_json_decode_error_misclassification.py (added, +119/-0)

Code Example

⚠️  API call failed (attempt 1/3): JSONDecodeError
   📝 Error: Expecting value: line 1 column 1 (char 0)
⚠️ Non-retryable error (HTTP None) — trying fallback...
Non-retryable client error (HTTP None). Aborting.

---

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, UnicodeEncodeError)
)

---

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, (UnicodeEncodeError, json.JSONDecodeError))
)
RAW_BUFFERClick to expand / collapse

Bug Description

Transient provider JSONDecodeError failures are being treated as local validation/programming errors and immediately routed to the non-retryable client-error path.

Because json.JSONDecodeError subclasses ValueError, current logic in run_agent.py classifies it as is_local_validation_error, which bypasses normal retry handling. This causes premature aborts for recoverable upstream/provider parse failures (e.g., empty or non-JSON transient responses).

Steps to Reproduce

Deterministic (unit-style repro)

  1. Instantiate AIAgent.
  2. Patch _interruptible_api_call with side effects:
    • first call: json.JSONDecodeError("Expecting value", "", 0)
    • second call: valid completion response
  3. Call run_conversation("hello").

Real-world repro seen in production

  1. Configure custom provider to https://api.llmgateway.io/v1 (OpenAI-compatible chat completions).
  2. Use model kimi-k2.6.
  3. Send a normal prompt during a period where upstream intermittently returns empty/non-JSON response.
  4. Observe first parse failure aborting as non-retryable instead of retrying.

Expected Behavior

  • JSONDecodeError from provider response parsing should go through normal API error classification/retry/fallback flow.
  • If next attempt succeeds, conversation should complete normally.

Actual Behavior

The first JSONDecodeError is treated as non-retryable local validation (HTTP None) and aborts early.

Environment

  • OS: Proxmox LXC guest (Debian-based) on Proxmox VE
  • Hermes repo: branch main, commit 402d048e (observed before local fix)
  • Python: Python 3.11.15
  • API mode: chat_completions
  • Provider: custom (LLMGateway), model kimi-k2.6

Error Output

⚠️  API call failed (attempt 1/3): JSONDecodeError
   📝 Error: Expecting value: line 1 column 1 (char 0)
⚠️ Non-retryable error (HTTP None) — trying fallback...
❌ Non-retryable client error (HTTP None). Aborting.

Request dump captured:

  • /root/.hermes/sessions/request_dump_20260423_011633_775f73_20260423_011738_719481.json
  • "error": {"type": "JSONDecodeError", "message": "Expecting value: line 1 column 1 (char 0)"}

Suspected Root Cause

In run_agent.py, around the client-error classification block (line ~10669 in current working tree):

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, UnicodeEncodeError)
)

Since json.JSONDecodeError is a ValueError, it is incorrectly flagged as local validation.

Proposed Fix

Exclude json.JSONDecodeError from local-validation fast-fail classification so it can use normal retry logic:

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, (UnicodeEncodeError, json.JSONDecodeError))
)

Regression Test Suggestion

Add test in tests/run_agent/test_run_agent.py asserting:

  • first API attempt raises JSONDecodeError
  • second attempt succeeds
  • run_conversation returns completed response (no immediate non-retryable abort path triggered)

Example test name:

  • test_jsondecode_error_is_retried_not_treated_as_local_validation

(Locally validated in working tree at tests/run_agent/test_run_agent.py line ~2822.)

extent analysis

TL;DR

Exclude json.JSONDecodeError from local-validation fast-fail classification to enable normal retry logic.

Guidance

  • Review the is_local_validation_error classification logic in run_agent.py to ensure it correctly handles json.JSONDecodeError as a retryable error.
  • Update the is_local_validation_error check to exclude json.JSONDecodeError, as proposed in the issue: is_local_validation_error = (isinstance(api_error, (ValueError, TypeError)) and not isinstance(api_error, (UnicodeEncodeError, json.JSONDecodeError))).
  • Verify the fix by running the suggested regression test in tests/run_agent/test_run_agent.py to ensure JSONDecodeError is retried and not treated as a local validation error.
  • Consider adding additional logging or monitoring to detect and handle similar issues in the future.

Example

is_local_validation_error = (
    isinstance(api_error, (ValueError, TypeError))
    and not isinstance(api_error, (UnicodeEncodeError, json.JSONDecodeError))
)

Notes

The proposed fix assumes that json.JSONDecodeError is the only ValueError subclass that should be retried. If other ValueError subclasses should also be retried, the is_local_validation_error check may need to be further updated.

Recommendation

Apply the proposed workaround by updating the is_local_validation_error check to exclude json.JSONDecodeError, as this will enable normal retry logic for this specific error type and prevent premature aborts.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - ✅(Solved) Fix JSONDecodeError misclassified as local validation error causes non-retryable abort (HTTP None) [3 pull requests, 1 participants]