hermes - 💡(How to fix) Fix [Bug]: Hermes sends oversized prompts after switching to lower-context local model; token estimation undercounts and compression can increase prompt size [3 pull requests]

StepCodex · 2026-05-11T12:45:31Z

[hermes] Bug Description This may be related to: - 20608 - 20724 - 23563 - 9627 - 20367 However, this report documents an additional failure mode where: - prom… ## Fixed - Fixed by PR: fix(agent): stop retrying regressive compression (https://github.com/NousResearch/hermes-agent/pull/23801) - Fixed by PR: fix(tool_result_storage): spill to local fs when no sandbox env (#23767) (https://github.com/NousResearch/hermes-agent/pull/23914) - Fixed by PR: feat(hooks): spill oversized hook-injected context to disk (https://github.com/NousResearch/hermes-agent/pull/20468) ### Bug Description This may be related to: - #20608 - #20724 - #23563 - #9627 - #20367 However, this report documents an additional failure mode where: - prompt size estimation diverges significantly from actual provider token count - compression can increase effective prompt size - and repeated oversized sends occur after switching from a higher-context provider to a lower-context custom local provider. Hermes appears to send requests that exceed the active model’s context window after switching an already-large session from a higher-context provider to a lower-context local OpenAI-compatible provider. Observed symptoms: - Hermes under-estimates prompt size by a large margin relative to the provider-reported count - Hermes attempts oversized requests before safe preflight compression - in repeated cases, “context compression” increases the resulting prompt size instead of reducing it - Hermes then retries and loops through repeated `Prompt too long` failures Representative errors from the logs: - `Prompt too long: 65798 tokens exceeds max context window of 65536 tokens` - `Prompt too long: 78723 tokens exceeds max context window of 65536 tokens` - `Prompt too long: 78748 tokens exceeds max context window of 65536 tokens` - `Prompt too long: 78786 tokens exceeds max context window of 65536 tokens` What happened? - I had an active Hermes session with substantial prior context built under a higher-context model/provider. - I switched to a lower-context custom local provider using oMLX at `http://127.0.0.1:7993/v1` with model `Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit`. - Hermes continued operating in the same session, performed web/tool calls, then began sending prompts that exceeded the local model’s configured `65536` token limit. - Hermes’s own compression/token estimates did not match the provider’s actual rejection counts. - In repeated cases, compression appears to have made the effective prompt larger, after which Hermes retried oversized sends. What I expected instead: - Hermes should re-evaluate the active session against the new model’s context limit immediately after a provider/model/base_url switch. - Hermes should not send any request larger than the active model’s context window. - If over limit, Hermes should compress before sending. - If compression fails to reduce size, Hermes should stop and tell the user to reset/start a fresh session. ### Steps to Reproduce 1. Start `hermes chat` using a higher-context provider/model. 2. Build up a non-trivial session history and/or large tool results. 3. Switch the active model/provider to a lower-context custom local OpenAI-compatible endpoint. 4. In my case the local route was: - base URL: `http://127.0.0.1:7993/v1` - model: `Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit` - configured context length: `65536` 5. Run a task that pulls in large web/tool outputs (in my case Firecrawl search + scrape results for Bitcoin news). 6. Hermes attempts a model call near the context ceiling, then receives a large tool result, then sends an oversized request. 7. Hermes logs repeated failures like: - `Prompt too long: 65798 tokens exceeds max context window of 65536 tokens` - followed by compression passes that report larger resulting prompt sizes - followed by repeated oversized retries ### Expected Behavior Hermes should never send a request larger than the active model’s context window. More specifically: - model/provider/base_url switches should invalidate stale context-size assumptions - Hermes should do a hard preflight token check before every outbound model request - if over limit, Hermes should compress before sending - compression should never produce a larger effective prompt than the input - if compression does not reduce enough, Hermes should stop and ask for a fresh session or `/reset` ### Actual Behavior Hermes sent oversized requests to the local provider and retried repeatedly. Key evidence from the logs: 1. Hermes was already very near the local model limit before a large tool result: ```text API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%) ``` 2. A very large Firecrawl result was then added: ```text tool mcp_firecrawl_firecrawl_search completed (7.74s, 279549 chars) Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write) ``` 3. Hermes immediately sent

hermes2026-05-11 12:45:31

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

Error Message

Additional Logs / Traceback (optional)

2026-05-11 18:21:22,833 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 65798 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}} 2026-05-11 18:23:36,092 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78723 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}} 2026-05-11 18:23:56,763 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78748 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}} 2026-05-11 18:24:16,535 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78786 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}

Warn that the current session may exceed the new model limit.
Avoid treating forge.local and 127.0.0.1 as unrelated context-length identities if they point to the same logical configured provider/model, or at least warn clearly when metadata diverges.

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

Fixed by PR: fix(agent): stop retrying regressive compression (https://github.com/NousResearch/hermes-agent/pull/23801)
Fixed by PR: fix(tool_result_storage): spill to local fs when no sandbox env (#23767) (https://github.com/NousResearch/hermes-agent/pull/23914)
Fixed by PR: feat(hooks): spill oversized hook-injected context to disk (https://github.com/NousResearch/hermes-agent/pull/20468)

Code Example

API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)

---

tool mcp_firecrawl_firecrawl_search completed (7.74s, 279549 chars)
Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)

---

Prompt too long: 65798 tokens exceeds max context window of 65536 tokens

---

context compression started: session=... messages=41 tokens=~43,227

---

API call #7: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=64186 out=281 total=64467 latency=34.5s cache=47104/64186 (73%)

---

context compression started: session=... messages=14 tokens=~64,186
context compression done: session=... messages=14->14 tokens=~71,173

---

Prompt too long: 78723 tokens exceeds max context window of 65536 tokens
Prompt too long: 78748 tokens exceeds max context window of 65536 tokens
Prompt too long: 78786 tokens exceeds max context window of 65536 tokens

---

Debug report uploaded:
Report     https://paste.rs/lhQ9N
agent.log  https://dpaste.com/3G94ER5FD

---

Representative excerpts:


2026-05-11 18:21:14,949 INFO [20260511_180601_82f04a] run_agent: API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)



2026-05-11 18:21:22,698 INFO [20260511_180601_82f04a] tools.tool_result_storage: Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)



2026-05-11 18:21:22,833 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 65798 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:21:22,836 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_180601_82f04a messages=41 tokens=~43,227 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None



2026-05-11 18:23:06,611 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_182146_fdb904 messages=14 tokens=~64,186 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None
2026-05-11 18:23:35,967 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182335_35b1a4 messages=14->14 tokens=~71,173



2026-05-11 18:23:36,092 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78723 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:23:54,613 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182354_a4cf52 messages=14->14 tokens=~71,199
2026-05-11 18:23:56,763 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78748 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:24:14,390 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182414_9a04f4 messages=14->14 tokens=~71,245
2026-05-11 18:24:16,535 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78786 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}


Also relevant: the same model was cached with different context lengths depending on base URL:


Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://forge.local:7993/v1 -> 32,768 tokens
Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://127.0.0.1:7993/v1 -> 65,536 tokens

RAW_BUFFERClick to expand / collapse

Bug Description

This may be related to:

#20608
#20724
#23563
#9627
#20367

However, this report documents an additional failure mode where:

prompt size estimation diverges significantly from actual provider token count
compression can increase effective prompt size
and repeated oversized sends occur after switching from a higher-context provider to a lower-context custom local provider.

Hermes appears to send requests that exceed the active model’s context window after switching an already-large session from a higher-context provider to a lower-context local OpenAI-compatible provider.

Observed symptoms:

Hermes under-estimates prompt size by a large margin relative to the provider-reported count
Hermes attempts oversized requests before safe preflight compression
in repeated cases, “context compression” increases the resulting prompt size instead of reducing it
Hermes then retries and loops through repeated Prompt too long failures

Representative errors from the logs:

Prompt too long: 65798 tokens exceeds max context window of 65536 tokens
Prompt too long: 78723 tokens exceeds max context window of 65536 tokens
Prompt too long: 78748 tokens exceeds max context window of 65536 tokens
Prompt too long: 78786 tokens exceeds max context window of 65536 tokens

What happened?

I had an active Hermes session with substantial prior context built under a higher-context model/provider.
I switched to a lower-context custom local provider using oMLX at http://127.0.0.1:7993/v1 with model Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit.
Hermes continued operating in the same session, performed web/tool calls, then began sending prompts that exceeded the local model’s configured 65536 token limit.
Hermes’s own compression/token estimates did not match the provider’s actual rejection counts.
In repeated cases, compression appears to have made the effective prompt larger, after which Hermes retried oversized sends.

What I expected instead:

Hermes should re-evaluate the active session against the new model’s context limit immediately after a provider/model/base_url switch.
Hermes should not send any request larger than the active model’s context window.
If over limit, Hermes should compress before sending.
If compression fails to reduce size, Hermes should stop and tell the user to reset/start a fresh session.

Steps to Reproduce

Start hermes chat using a higher-context provider/model.
Build up a non-trivial session history and/or large tool results.
Switch the active model/provider to a lower-context custom local OpenAI-compatible endpoint.
In my case the local route was:
- base URL: http://127.0.0.1:7993/v1
- model: Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit
- configured context length: 65536
Run a task that pulls in large web/tool outputs (in my case Firecrawl search + scrape results for Bitcoin news).
Hermes attempts a model call near the context ceiling, then receives a large tool result, then sends an oversized request.
Hermes logs repeated failures like:
- Prompt too long: 65798 tokens exceeds max context window of 65536 tokens
- followed by compression passes that report larger resulting prompt sizes
- followed by repeated oversized retries

Expected Behavior

Hermes should never send a request larger than the active model’s context window.

More specifically:

model/provider/base_url switches should invalidate stale context-size assumptions
Hermes should do a hard preflight token check before every outbound model request
if over limit, Hermes should compress before sending
compression should never produce a larger effective prompt than the input
if compression does not reduce enough, Hermes should stop and ask for a fresh session or /reset

Actual Behavior

Hermes sent oversized requests to the local provider and retried repeatedly.

Key evidence from the logs:

Hermes was already very near the local model limit before a large tool result:

API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)

A very large Firecrawl result was then added:

tool mcp_firecrawl_firecrawl_search completed (7.74s, 279549 chars)
Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)

Hermes immediately sent an oversized request:

Prompt too long: 65798 tokens exceeds max context window of 65536 tokens

Hermes’s own estimate right after that failure was far lower than the provider-reported actual size:

context compression started: session=... messages=41 tokens=~43,227

Later, Hermes got back near the ceiling again:

API call #7: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=64186 out=281 total=64467 latency=34.5s cache=47104/64186 (73%)

Then compression appears to expand the prompt instead of reducing it:

context compression started: session=... messages=14 tokens=~64,186
context compression done: session=... messages=14->14 tokens=~71,173

Hermes then retries and fails with even larger oversized sends:

Prompt too long: 78723 tokens exceeds max context window of 65536 tokens
Prompt too long: 78748 tokens exceeds max context window of 65536 tokens
Prompt too long: 78786 tokens exceeds max context window of 65536 tokens

This strongly suggests prompt assembly / token accounting / compression safety is wrong in this path.

Primary affected component:

context management / context compression / token accounting / prompt assembly

Likely affected subsystems:

model switch handling
provider metadata / context-length caching
large tool result retention / truncation strategy
custom local OpenAI-compatible provider routing

Affected Component

Agent Core (conversation loop, context compression, memory)

Messaging Platform (if gateway-related)

N/A (CLI only)

Debug Report

Debug report uploaded:
Report     https://paste.rs/lhQ9N
agent.log  https://dpaste.com/3G94ER5FD

Operating System

macOS 15.7.4

Python Version

3.11.15

Hermes Version

2026.5.7

Additional Logs / Traceback (optional)

Representative excerpts:


2026-05-11 18:21:14,949 INFO [20260511_180601_82f04a] run_agent: API call #5: model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit provider=custom in=65101 out=202 total=65303 latency=7.5s cache=63488/65101 (98%)



2026-05-11 18:21:22,698 INFO [20260511_180601_82f04a] tools.tool_result_storage: Inline-truncating large tool result: mcp_firecrawl_firecrawl_search (279549 chars, no sandbox write)



2026-05-11 18:21:22,833 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 65798 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:21:22,836 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_180601_82f04a messages=41 tokens=~43,227 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None



2026-05-11 18:23:06,611 INFO [20260511_180601_82f04a] run_agent: context compression started: session=20260511_182146_fdb904 messages=14 tokens=~64,186 model=Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit focus=None
2026-05-11 18:23:35,967 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182335_35b1a4 messages=14->14 tokens=~71,173



2026-05-11 18:23:36,092 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78723 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:23:54,613 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182354_a4cf52 messages=14->14 tokens=~71,199
2026-05-11 18:23:56,763 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78748 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}



2026-05-11 18:24:14,390 INFO [20260511_180601_82f04a] run_agent: context compression done: session=20260511_182414_9a04f4 messages=14->14 tokens=~71,245
2026-05-11 18:24:16,535 INFO run_agent: Streaming failed before delivery: Error code: 400 - {'error': {'message': 'Prompt too long: 78786 tokens exceeds max context window of 65536 tokens', 'type': 'invalid_request_error', 'param': None, 'code': None}}


Also relevant: the same model was cached with different context lengths depending on base URL:


Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://forge.local:7993/v1 -> 32,768 tokens
Cached context length Qwen3.6-35B-A3B-Uncensored-Heretic-MLX-4bit@http://127.0.0.1:7993/v1 -> 65,536 tokens

Root Cause Analysis (optional)

The logs suggest a Hermes-side failure in one or more of these areas:

Prompt-size estimation diverges materially from actual provider token count.
- Example: provider rejects at 65,798 tokens while Hermes begins compression from an estimate of ~43,227.
Compression is not guaranteed before an oversized send.
- Hermes appears to attempt the request first, then compress only after the provider rejects it.
Compression/rebuild can increase effective prompt size.
- Example: ~64,186 before compression becomes ~71,173 after compression.
- This suggests summary insertion, preserved-tail logic, tool-result retention, or prompt rebuild duplication may be expanding the prompt.
Model/provider/base_url switching may leave stale assumptions in place.
- A large session built under a higher-context provider is carried into a lower-context local provider.
- Context-length metadata also appears sensitive to exact base_url, which may fragment cache identity for logically identical local routes.

Proposed Fix (optional)

Add a hard preflight token guard before every outbound model request.
- If the assembled request exceeds the active model context window, compress before sending.
- Never send an already-over-budget request.
Invalidate context-size assumptions on provider/model/base_url switch.
- Recompute active limits, token budgets, and compression thresholds immediately.
Enforce a compression safety invariant.
- If “compressed” prompt size is greater than or equal to the original, treat compression as failed and stop instead of retrying.
Be more aggressive with large tool outputs on lower-context models.
- Offload, summarize, or reference tool results compactly rather than retaining large inline truncated payloads.
Add a user-facing guardrail when switching from a high-context provider to a substantially lower-context provider.
- Warn that the current session may exceed the new model limit.
- Recommend /reset or starting a fresh session.
Revisit metadata/cache identity for local providers.
- Avoid treating forge.local and 127.0.0.1 as unrelated context-length identities if they point to the same logical configured provider/model, or at least warn clearly when metadata diverges.

Are you willing to submit a PR for this?

I'd like to fix this myself and submit a PR

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #SSR setup #ISR setup #authentication setup #request error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [Bug]: Hermes sends oversized prompts after switching to lower-context local model; token estimation undercounts and compression can increase prompt size [3 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Additional Logs / Traceback (optional)

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [Bug]: Hermes sends oversized prompts after switching to lower-context local model; token estimation undercounts and compression can increase prompt size [3 pull requests]

Recommended Tools

GitHub issue graph ai analysis

Error Message

Additional Logs / Traceback (optional)

Root Cause

Root Cause Analysis (optional)

Fix Action

Fixed

Code Example

Bug Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Affected Component

Messaging Platform (if gateway-related)

Debug Report

Operating System

Python Version

Hermes Version

Additional Logs / Traceback (optional)

Root Cause Analysis (optional)

Proposed Fix (optional)

Are you willing to submit a PR for this?

Still need to ship something?

RELATED_DISCOVERY

TRENDING