hermes - 💡(How to fix) Fix [Bug]: Hermes sends oversized prompts after switching to lower-context local model; token estimation undercounts and compression can increase prompt size [3 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

Additional Logs / Traceback (optional)

2026-05-11 18:21:22,833 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 65798 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}} 2026-05-11 18:23:36,092 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78723 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}} 2026-05-11 18:23:56,763 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78748 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}} 2026-05-11 18:24:16,535 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78786 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}

  • Warn that the current session may exceed the new model limit.
  • Avoid treating forge.local and 127.0.0.1 as unrelated context-length identities if they point to the same logical configured provider/model, or at least warn clearly when metadata diverges.

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

Code Example

API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)

---

tool mcp_firecrawl_firecrawl_search completed (7.74s, 279549 chars)
Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)

---

Prompt too long: 65798 tokens exceeds max context window of 65536 tokens

---

context compression started: session=... messages=41 tokens=~43,227

---

API call #7: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=64186 out=281 total=64467 latency=34.5s cache=47104/64186 (73%)

---

context compression started: session=... messages=14 tokens=~64,186
context compression done: session=... messages=14->14 tokens=~71,173

---

Prompt too long: 78723 tokens exceeds max context window of 65536 tokens
Prompt too long: 78748 tokens exceeds max context window of 65536 tokens
Prompt too long: 78786 tokens exceeds max context window of 65536 tokens

---

Debug report uploaded:
Report     https://paste.rs/lhQ9N
agent.log  https://dpaste.com/3G94ER5FD

---

Representative excerpts:


2026-05-11 18:21:14,949 INFO [20260511_180601_82f04a] run_agent: API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)



2026-05-11 18:21:22,698 INFO [20260511_180601_82f04a] tools.tool_result_storage: Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)



2026-05-11 18:21:22,833 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 65798 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:21:22,836 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_180601_82f04a messages=41 tokens=~43,227 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None



2026-05-11 18:23:06,611 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_182146_fdb904 messages=14 tokens=~64,186 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None
2026-05-11 18:23:35,967 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182335_35b1a4 messages=14->14 tokens=~71,173



2026-05-11 18:23:36,092 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78723 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:23:54,613 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182354_a4cf52 messages=14->14 tokens=~71,199
2026-05-11 18:23:56,763 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78748 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:24:14,390 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182414_9a04f4 messages=14->14 tokens=~71,245
2026-05-11 18:24:16,535 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78786 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}


Also relevant: the same model was cached with different context lengths depending on base URL:


Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://forge.local:7993/v1 -> 32,768 tokens
Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://127.0.0.1:7993/v1 -> 65,536 tokens
RAW_BUFFERClick to expand / collapse

Bug Description

This may be related to:

  • #20608
  • #20724
  • #23563
  • #9627
  • #20367

However, this report documents an additional failure mode where:

  • prompt size estimation diverges significantly from actual provider token count
  • compression can increase effective prompt size
  • and repeated oversized sends occur after switching from a higher-context provider to a lower-context custom local provider.

Hermes appears to send requests that exceed the active model’s context window after switching an already-large session from a higher-context provider to a lower-context local OpenAI-compatible provider.

Observed symptoms:

  • Hermes under-estimates prompt size by a large margin relative to the provider-reported count
  • Hermes attempts oversized requests before safe preflight compression
  • in repeated cases, “context compression” increases the resulting prompt size instead of reducing it
  • Hermes then retries and loops through repeated Prompt too long failures

Representative errors from the logs:

  • Prompt too long: 65798 tokens exceeds max context window of 65536 tokens
  • Prompt too long: 78723 tokens exceeds max context window of 65536 tokens
  • Prompt too long: 78748 tokens exceeds max context window of 65536 tokens
  • Prompt too long: 78786 tokens exceeds max context window of 65536 tokens

What happened?

  • I had an active Hermes session with substantial prior context built under a higher-context model/provider.
  • I switched to a lower-context custom local provider using oMLX at http://127.0.0.1:7993/v1 with model Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit.
  • Hermes continued operating in the same session, performed web/tool calls, then began sending prompts that exceeded the local model’s configured 65536 token limit.
  • Hermes’s own compression/token estimates did not match the provider’s actual rejection counts.
  • In repeated cases, compression appears to have made the effective prompt larger, after which Hermes retried oversized sends.

What I expected instead:

  • Hermes should re-evaluate the active session against the new model’s context limit immediately after a provider/model/base_url switch.
  • Hermes should not send any request larger than the active model’s context window.
  • If over limit, Hermes should compress before sending.
  • If compression fails to reduce size, Hermes should stop and tell the user to reset/start a fresh session.

Steps to Reproduce

  1. Start hermes chat using a higher-context provider/model.
  2. Build up a non-trivial session history and/or large tool results.
  3. Switch the active model/provider to a lower-context custom local OpenAI-compatible endpoint.
  4. In my case the local route was:
    • base URL: http://127.0.0.1:7993/v1
    • model: Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit
    • configured context length: 65536
  5. Run a task that pulls in large web/tool outputs (in my case Firecrawl search + scrape results for Bitcoin news).
  6. Hermes attempts a model call near the context ceiling, then receives a large tool result, then sends an oversized request.
  7. Hermes logs repeated failures like:
    • Prompt too long: 65798 tokens exceeds max context window of 65536 tokens
    • followed by compression passes that report larger resulting prompt sizes
    • followed by repeated oversized retries

Expected Behavior

Hermes should never send a request larger than the active model’s context window.

More specifically:

  • model/provider/base_url switches should invalidate stale context-size assumptions
  • Hermes should do a hard preflight token check before every outbound model request
  • if over limit, Hermes should compress before sending
  • compression should never produce a larger effective prompt than the input
  • if compression does not reduce enough, Hermes should stop and ask for a fresh session or /reset

Actual Behavior

Hermes sent oversized requests to the local provider and retried repeatedly.

Key evidence from the logs:

  1. Hermes was already very near the local model limit before a large tool result:
API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)
  1. A very large Firecrawl result was then added:
tool mcp_firecrawl_firecrawl_search completed (7.74s, 279549 chars)
Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)
  1. Hermes immediately sent an oversized request:
Prompt too long: 65798 tokens exceeds max context window of 65536 tokens
  1. Hermes’s own estimate right after that failure was far lower than the provider-reported actual size:
context compression started: session=... messages=41 tokens=~43,227
  1. Later, Hermes got back near the ceiling again:
API call #7: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=64186 out=281 total=64467 latency=34.5s cache=47104/64186 (73%)
  1. Then compression appears to expand the prompt instead of reducing it:
context compression started: session=... messages=14 tokens=~64,186
context compression done: session=... messages=14->14 tokens=~71,173
  1. Hermes then retries and fails with even larger oversized sends:
Prompt too long: 78723 tokens exceeds max context window of 65536 tokens
Prompt too long: 78748 tokens exceeds max context window of 65536 tokens
Prompt too long: 78786 tokens exceeds max context window of 65536 tokens

This strongly suggests prompt assembly / token accounting / compression safety is wrong in this path.

Primary affected component:

  • context management / context compression / token accounting / prompt assembly

Likely affected subsystems:

  • model switch handling
  • provider metadata / context-length caching
  • large tool result retention / truncation strategy
  • custom local OpenAI-compatible provider routing

Affected Component

Agent Core (conversation loop, context compression, memory)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

Debug report uploaded:
Report     https://paste.rs/lhQ9N
agent.log  https://dpaste.com/3G94ER5FD

Operating System

macOS 15.7.4

Python Version

3.11.15

Hermes Version

2026.5.7

Additional Logs / Traceback (optional)

Representative excerpts:


2026-05-11 18:21:14,949 INFO [20260511_180601_82f04a] run_agent: API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)



2026-05-11 18:21:22,698 INFO [20260511_180601_82f04a] tools.tool_result_storage: Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)



2026-05-11 18:21:22,833 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 65798 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:21:22,836 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_180601_82f04a messages=41 tokens=~43,227 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None



2026-05-11 18:23:06,611 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_182146_fdb904 messages=14 tokens=~64,186 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None
2026-05-11 18:23:35,967 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182335_35b1a4 messages=14->14 tokens=~71,173



2026-05-11 18:23:36,092 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78723 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:23:54,613 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182354_a4cf52 messages=14->14 tokens=~71,199
2026-05-11 18:23:56,763 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78748 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:24:14,390 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182414_9a04f4 messages=14->14 tokens=~71,245
2026-05-11 18:24:16,535 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78786 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}


Also relevant: the same model was cached with different context lengths depending on base URL:


Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://forge.local:7993/v1 -> 32,768 tokens
Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://127.0.0.1:7993/v1 -> 65,536 tokens

Root Cause Analysis (optional)

The logs suggest a Hermes-side failure in one or more of these areas:

  1. Prompt-size estimation diverges materially from actual provider token count.

    • Example: provider rejects at 65,798 tokens while Hermes begins compression from an estimate of ~43,227.
  2. Compression is not guaranteed before an oversized send.

    • Hermes appears to attempt the request first, then compress only after the provider rejects it.
  3. Compression/rebuild can increase effective prompt size.

    • Example: ~64,186 before compression becomes ~71,173 after compression.
    • This suggests summary insertion, preserved-tail logic, tool-result retention, or prompt rebuild duplication may be expanding the prompt.
  4. Model/provider/base_url switching may leave stale assumptions in place.

    • A large session built under a higher-context provider is carried into a lower-context local provider.
    • Context-length metadata also appears sensitive to exact base_url, which may fragment cache identity for logically identical local routes.

Proposed Fix (optional)

  1. Add a hard preflight token guard before every outbound model request.

    • If the assembled request exceeds the active model context window, compress before sending.
    • Never send an already-over-budget request.
  2. Invalidate context-size assumptions on provider/model/base_url switch.

    • Recompute active limits, token budgets, and compression thresholds immediately.
  3. Enforce a compression safety invariant.

    • If “compressed” prompt size is greater than or equal to the original, treat compression as failed and stop instead of retrying.
  4. Be more aggressive with large tool outputs on lower-context models.

    • Offload, summarize, or reference tool results compactly rather than retaining large inline truncated payloads.
  5. Add a user-facing guardrail when switching from a high-context provider to a substantially lower-context provider.

    • Warn that the current session may exceed the new model limit.
    • Recommend /reset or starting a fresh session.
  6. Revisit metadata/cache identity for local providers.

    • Avoid treating forge.local and 127.0.0.1 as unrelated context-length identities if they point to the same logical configured provider/model, or at least warn clearly when metadata diverges.

Are you willing to submit a PR for this?

  • I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING