hermes - ✅(Solved) Fix Generic 400/disconnect errors misclassified as context_overflow in 1M-context sessions [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#16351Fetched 2026-04-28 06:53:52
View on GitHub
Comments
0
Participants
1
Timeline
11
Reactions
0
Author
Participants
Timeline (top)
referenced ×6labeled ×3cross-referenced ×2

Error Message

from agent.error_classifier import classify_api_error

class FakeHTTP400(Exception): status_code = 400 body = {"error": {"message": "Error"}} def str(self): return "Error"

result = classify_api_error( FakeHTTP400(), provider="openai-codex", model="gpt-5.5", approx_tokens=74320, context_length=1_000_000, num_messages=432, )

print(result.reason, result.retryable, result.should_compress)

Root Cause

Current agent/error_classifier.py has heuristics equivalent to:

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or approx_tokens > 120000 or num_messages > 200

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or approx_tokens > 80000 or num_messages > 80

The absolute fallbacks are reasonable for ~128K/200K context windows, but they are too aggressive for 1M-context sessions. A long session can have hundreds of messages while still being well below the actual context budget.

Fix Action

Fixed

PR fix notes

PR #16352: fix(error_classifier): avoid large-context false overflow heuristics

Description (problem / solution / changelog)

Summary

Fixes a large-context false-positive in agent.error_classifier: generic HTTP 400 responses and server disconnects should not be classified as context_overflow solely because a 1M-context session has many messages.

The previous heuristic used absolute fallbacks:

approx_tokens > 80000 or num_messages > 80
approx_tokens > 120000 or num_messages > 200

Those thresholds are useful proxies for smaller context windows, but they are too aggressive for explicitly large windows. A 1M-context session can have 432 messages and ~74K estimated tokens while still being far below the real budget.

This patch keeps the relative pressure checks for all models, but gates the absolute token/message-count fallbacks to smaller context windows (<= 256000).

Behavior covered

  • Generic 400 with approx_tokens=74320, context_length=1_000_000, num_messages=432 is now format_error, not context_overflow.
  • Server disconnect with the same low-pressure 1M context shape is now timeout, not context_overflow.
  • Existing smaller-window behavior remains covered by existing tests.

Test plan

RED before fix:

/home/ubuntu/.hermes/hermes-agent/venv/bin/python -m pytest \
  tests/agent/test_error_classifier.py::TestClassifyApiError::test_400_generic_many_messages_below_large_context_pressure_is_format_error \
  tests/agent/test_error_classifier.py::TestClassifyApiError::test_disconnect_many_messages_below_large_context_pressure_is_timeout \
  -v -o 'addopts='

Both tests failed with FailoverReason.context_overflow.

GREEN after fix:

/home/ubuntu/.hermes/hermes-agent/venv/bin/python -m pytest tests/agent/test_error_classifier.py -q -o 'addopts='
/home/ubuntu/.hermes/hermes-agent/venv/bin/python -m py_compile agent/error_classifier.py tests/agent/test_error_classifier.py
git diff --check

Result:

120 passed

Manual reproduction after fix:

FakeHTTP400 FailoverReason.format_error False False
Exception FailoverReason.timeout True False

Fixes #16351

Related: #14499, #14858, #14953, #15844, #6751

Changed files

  • agent/error_classifier.py (modified, +12/-2)
  • tests/agent/test_error_classifier.py (modified, +32/-0)

PR #16380: fix(error_classifier): gate absolute msg/token heuristics to small context windows

Description (problem / solution / changelog)

Closes #16351.

Problem

agent/error_classifier.py flagged non-context errors as context_overflow in long-context (1M) Codex/GPT-5.x sessions, purely because num_messages > 80 (generic 400) or num_messages > 200 (disconnect) — even when approx_tokens was a fraction of the actual budget.

Repro from the issue:

classify_api_error(
    FakeHTTP400(),
    provider="openai-codex",
    model="gpt-5.5",
    approx_tokens=74320,
    context_length=1_000_000,
    num_messages=432,
)
# Before: FailoverReason.context_overflow (retryable=True, should_compress=True)
# After:  FailoverReason.format_error      (retryable=False, should_compress=False)

That sent format errors into the compression/probe-down path, causing unnecessary compaction and stale handoff pollution on 1M sessions.

Fix

Apply exactly the gate suggested in the issue body: scope absolute token/message-count fallbacks to context_length <= 256000. Relative pressure thresholds (> 0.6 for disconnect, > 0.4 for generic 400) still fire on any context size.

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or (
    context_length <= 256000 and (approx_tokens > 120000 or num_messages > 200)
)

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or (
    context_length <= 256000 and (approx_tokens > 80000 or num_messages > 80)
)

Existing behavior for ~128K/200K context windows is unchanged.

Tests

tests/agent/test_error_classifier.py — 4 new tests covering the 1M-context regime:

  • test_400_generic_1m_context_high_message_count_not_overflow — exact repro from issue (74K tokens, 432 msgs, 1M ctx) → format_error.
  • test_400_generic_1m_context_relative_pressure_still_overflow — 500K tokens / 1M ctx still → context_overflow.
  • test_disconnect_1m_context_high_message_count_is_timeout — 150K tokens, 300 msgs, 1M ctx → timeout.
  • test_disconnect_1m_context_relative_pressure_still_overflow — 700K tokens / 1M ctx still → context_overflow.
pytest tests/agent/test_error_classifier.py -q
122 passed (118 pre-existing + 4 new)

Changed files

  • agent/error_classifier.py (modified, +6/-2)
  • tests/agent/test_error_classifier.py (modified, +62/-0)

Code Example

from agent.error_classifier import classify_api_error

class FakeHTTP400(Exception):
    status_code = 400
    body = {"error": {"message": "Error"}}
    def __str__(self):
        return "Error"

result = classify_api_error(
    FakeHTTP400(),
    provider="openai-codex",
    model="gpt-5.5",
    approx_tokens=74320,
    context_length=1_000_000,
    num_messages=432,
)

print(result.reason, result.retryable, result.should_compress)

---

FailoverReason.context_overflow True True

---

FailoverReason.format_error False False

---

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or approx_tokens > 120000 or num_messages > 200

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or approx_tokens > 80000 or num_messages > 80

---

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or (
    context_length <= 256000 and (approx_tokens > 120000 or num_messages > 200)
)

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or (
    context_length <= 256000 and (approx_tokens > 80000 or num_messages > 80)
)
RAW_BUFFERClick to expand / collapse

Bug description

agent.error_classifier.classify_api_error() can misclassify generic HTTP 400 errors and server disconnects as FailoverReason.context_overflow in explicitly large-context sessions (for example 1M-token Codex/GPT-5.x sessions), even when the prompt is far below the configured context window.

The problematic path is the absolute size/message-count heuristic. On current main, a generic 400 with many messages is classified as context overflow because num_messages > 80, even when approx_tokens is only ~74K against a 1M context window.

Minimal reproduction

from agent.error_classifier import classify_api_error

class FakeHTTP400(Exception):
    status_code = 400
    body = {"error": {"message": "Error"}}
    def __str__(self):
        return "Error"

result = classify_api_error(
    FakeHTTP400(),
    provider="openai-codex",
    model="gpt-5.5",
    approx_tokens=74320,
    context_length=1_000_000,
    num_messages=432,
)

print(result.reason, result.retryable, result.should_compress)

Current result:

FailoverReason.context_overflow True True

Expected result:

FailoverReason.format_error False False

A similar issue exists for server disconnect messages with the same low token pressure / high message count shape: the absolute num_messages > 200 branch classifies it as context_overflow instead of a transport/timeout condition.

Root cause

Current agent/error_classifier.py has heuristics equivalent to:

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or approx_tokens > 120000 or num_messages > 200

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or approx_tokens > 80000 or num_messages > 80

The absolute fallbacks are reasonable for ~128K/200K context windows, but they are too aggressive for 1M-context sessions. A long session can have hundreds of messages while still being well below the actual context budget.

User impact

This sends non-context errors into the context-overflow recovery path. In long-context Codex sessions, that can cause unnecessary compression and runtime context probe-down from an explicit 1M window to lower probe tiers (currently 256K/128K depending on branch/version), which can lead to repeated compaction and stale handoff pollution.

Suggested fix

Gate the absolute token/message-count heuristics to smaller context windows, and require relative pressure for large-context models. For example:

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or (
    context_length <= 256000 and (approx_tokens > 120000 or num_messages > 200)
)

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or (
    context_length <= 256000 and (approx_tokens > 80000 or num_messages > 80)
)

This preserves existing behavior for smaller context windows while preventing 1M sessions from being classified as overflow solely because they have many messages.

Related work

Related but not identical:

  • #14499: prevents direct long-context probe collapse by changing probe tiers
  • #14858: guards untrusted probe shrink when the guessed tier is below the current prompt estimate
  • #14953: preserves explicit context window after generic overflow
  • #15844: merged context-length propagation/probe-tier changes
  • #6751: fixed one Codex 400-format-error compression loop by parsing flat 400 bodies

This issue is specifically about the classifier entering context_overflow too early for large context windows due to absolute message-count/token heuristics.

extent analysis

TL;DR

Update the agent/error_classifier.py heuristics to gate absolute token/message-count checks based on context window size to prevent misclassification of large-context sessions.

Guidance

  • Review the current agent/error_classifier.py heuristics and update the conditions to include context window size checks, as suggested in the issue.
  • Verify the changes by running the minimal reproduction code with the updated heuristics and checking the classification result.
  • Test the updated classifier with various input scenarios, including large-context sessions with high message counts, to ensure correct classification.
  • Consider adding additional logging or monitoring to detect and report any potential misclassifications.

Example

# server disconnect path
is_large = approx_tokens > context_length * 0.6 or (
    context_length <= 256000 and (approx_tokens > 120000 or num_messages > 200)
)

# generic 400 path
is_large = approx_tokens > context_length * 0.4 or (
    context_length <= 256000 and (approx_tokens > 80000 or num_messages > 80)
)

Notes

The suggested fix is specific to the agent/error_classifier.py file and may require additional testing and validation to ensure correct behavior in all scenarios.

Recommendation

Apply the suggested workaround by updating the agent/error_classifier.py heuristics to gate absolute token/message-count checks based on context window size, as this should prevent misclassification of large-context sessions and improve the overall accuracy of the error classifier.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING