hermes - ✅(Solved) Fix [Bug]: Anthropic max_tokens fallback only fires for OpenRouter/Nous — silently skipped for Bedrock, NVIDIA, LiteLLM, and every other chat-completions proxy [1 pull requests, 1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#12790Fetched 2026-04-20 12:16:57
View on GitHub
Comments
1
Participants
2
Timeline
4
Reactions
0
Author
Timeline (top)
closed ×1commented ×1cross-referenced ×1referenced ×1

Fix Action

Fix / Workaround

AWS Bedrock historically defaults to 4096 output tokens. With a Claude model, reasoning tokens plus a single large tool call (e.g. write_file with a full file, patch with a multi-hunk diff) easily exceed that. The request comes back with finish_reason='length' → 3 continuation retries → rollback → the user sees:

  1. Ask the agent to do anything that requires a large tool call — e.g. “write a 500-line Python script to do X” or “patch this 200-line function”.

PR fix notes

PR #12811: fix: Anthropic max_tokens + Feishu channel_prompt + Discord free_response

Description (problem / solution / changelog)

Fixes #12790: Anthropic max_tokens fallback for all proxies

The max_tokens fallback in _build_api_kwargs() only fired when the base URL was OpenRouter or Nous Portal. Any other chat-completions proxy (AWS Bedrock, NVIDIA, LiteLLM, vLLM, corporate gateways) serving Claude models would silently bypass the fallback, shipping requests without max_tokens set. Proxies like AWS Bedrock default to 4096 output tokens, which easily exhausts on thinking tokens + large tool calls like write_file or patch.

Fix: Changed the condition from URL-gated to model-gated. If the model is in _ANTHROPIC_OUTPUT_LIMITS (Claude, MiniMax), max_tokens is always set regardless of which proxy serves it.

Fixes #12805: Feishu adapter missing channel_prompt resolution

The Feishu platform adapter did not implement per-channel prompt resolution (_resolve_channel_prompt), unlike Discord and Slack which both support this feature. This means channel_prompts config in config.yaml was silently ignored for Feishu.

Fix: Added _resolve_channel_prompt() method matching the Discord/Slack pattern. Passes channel_prompt to all three MessageEvent construction sites (main message, reaction routing, card action routing).

Fixes #12750: Discord free_response_channels not working

The on_message handler checked DISCORD_IGNORE_NO_MENTION (default: true) before _handle_message was called. When a message had no @mention (human or bot), it returned early — never reaching the free_response_channels logic inside _handle_message. This meant free response channels could never receive unmentioned messages.

Fix: In the DISCORD_IGNORE_NO_MENTION gate at on_message, check the channel against DISCORD_FREE_RESPONSE_CHANNELS before returning early. Configured channels are exempted from the ignore-no-mention filter.


Discussed in: #12790, #12805, #12750

Changed files

  • gateway/platforms/discord.py (modified, +8/-2)
  • gateway/platforms/feishu.py (modified, +8/-0)
  • run_agent.py (modified, +12/-11)

Code Example

# ~/.hermes/config.yaml
   model:
     model: aws/anthropic/bedrock-claude-opus-4-7
     provider: custom  # base_url: NVIDIA inference API (Bedrock-backed)

---

# run_agent.py, inside _build_api_kwargs
elif (self._is_openrouter_url() or "nousresearch" in self._base_url_lower) \
        and "claude" in (self.model or "").lower():
    # inject max_tokens...
RAW_BUFFERClick to expand / collapse

Bug Description

AIAgent._build_api_kwargs() in run_agent.py only injects an explicit max_tokens for Anthropic-compatible models (Claude, MiniMax) when the request is served through OpenRouter or Nous Portal. Every other chat-completions proxy silently bypasses the fallback, shipping requests without max_tokens set, and lets the upstream proxy pick its own default.

AWS Bedrock historically defaults to 4096 output tokens. With a Claude model, reasoning tokens plus a single large tool call (e.g. write_file with a full file, patch with a multi-hunk diff) easily exceed that. The request comes back with finish_reason='length' → 3 continuation retries → rollback → the user sees:

⚠️ Response truncated due to output length limit

…on what should have been a simple response. Same class of problem applies to any non-OpenRouter / non-Nous proxy that serves Claude or other Anthropic-compatible models: NVIDIA inference API, self-hosted LiteLLM, vLLM, corporate gateways, etc.

Steps to Reproduce

  1. Configure Hermes with a chat-completions provider whose base_url is not openrouter.ai and not nousresearch.com, serving an Anthropic-compatible model. Concrete example (my setup):

    # ~/.hermes/config.yaml
    model:
      model: aws/anthropic/bedrock-claude-opus-4-7
      provider: custom  # base_url: NVIDIA inference API (Bedrock-backed)
  2. Ask the agent to do anything that requires a large tool call — e.g. “write a 500-line Python script to do X” or “patch this 200-line function”.

  3. Observe repeated ⚠️ Response truncated due to output length limit warnings on what should be straightforward tool calls, followed by retries and rollbacks.

Expected Behavior

If the model is Anthropic-compatible (lookup in _ANTHROPIC_OUTPUT_LIMITS), Hermes should send its documented output limit as max_tokens regardless of which proxy URL is serving it. The proxy choice is orthogonal to the upstream API contract — the Messages API requires max_tokens as a mandatory field, period.

Actual Behavior

Hermes only injects max_tokens when _is_openrouter_url() or "nousresearch" in _base_url_lower. For any other proxy, max_tokens is left unset → proxy picks a default (Bedrock: 4096) → truncation and retry storm.

Relevant code (current main):

# run_agent.py, inside _build_api_kwargs
elif (self._is_openrouter_url() or "nousresearch" in self._base_url_lower) \
        and "claude" in (self.model or "").lower():
    # inject max_tokens...

The base_url gate silently excludes every other Anthropic-compat proxy. The "claude" in model substring also misses non-Claude Anthropic-compatible models (e.g. MiniMax) even when they're served through OpenRouter.

Affected Component

  • Agent Core (conversation loop, build_api_kwargs)

Scope of impact

Anyone serving Anthropic-compatible models through a non-OpenRouter / non-Nous proxy:

  • AWS Bedrock (Claude family) — most common trigger
  • NVIDIA inference API with Anthropic models
  • Self-hosted LiteLLM / vLLM routing to Anthropic
  • Corporate gateways / reverse proxies
  • Any custom_providers entry hitting an Anthropic-compatible endpoint

Hermes' native Anthropic path (api_mode='anthropic_messages') is unaffected — it doesn't reach this code branch.

Proposed Fix

Gate on the model instead of the base_url. If the model matches a known entry in _ANTHROPIC_OUTPUT_LIMITS, inject max_tokens regardless of the proxy URL. This is table-driven, not substring-driven, so it correctly covers non-Claude Anthropic-compatible models without hardcoding family names.

PR with the fix + regression tests: #12767 (fix: apply Anthropic max_tokens fallback to all chat_completions proxies).

The PR adds:

  • _matches_anthropic_compatible_model() helper (proxy-agnostic lookup)
  • 5 new TestBuildApiKwargs test cases guarding:
    • existing OpenRouter+Claude behaviour (regression guard)
    • custom proxy + Claude (NVIDIA Bedrock scenario)
    • custom proxy + MiniMax (non-Claude Anthropic-compat)
    • unknown models must NOT receive the fallback (negative test)
    • meta-test: the condition must not hardcode the "claude" substring

Environment

  • Hermes main model: aws/anthropic/bedrock-claude-opus-4-7 via custom provider (NVIDIA inference API / Bedrock-backed)
  • Observed in both CLI and Feishu gateway sessions
  • Python 3.11, Linux

extent analysis

TL;DR

The issue can be fixed by modifying the _build_api_kwargs method to inject max_tokens based on the model instead of the proxy URL.

Guidance

  • Identify the models that are Anthropic-compatible by checking the _ANTHROPIC_OUTPUT_LIMITS dictionary.
  • Create a helper function _matches_anthropic_compatible_model() to perform a proxy-agnostic lookup of the model.
  • Modify the _build_api_kwargs method to inject max_tokens if the model matches a known entry in _ANTHROPIC_OUTPUT_LIMITS, regardless of the proxy URL.
  • Add regression tests to ensure the fix works correctly for different scenarios, including custom proxies and non-Claude Anthropic-compatible models.

Example

def _matches_anthropic_compatible_model(self):
    return self.model in _ANTHROPIC_OUTPUT_LIMITS

def _build_api_kwargs(self):
    # ...
    if self._matches_anthropic_compatible_model():
        # inject max_tokens
        kwargs['max_tokens'] = _ANTHROPIC_OUTPUT_LIMITS[self.model]
    # ...

Notes

The proposed fix is to gate on the model instead of the base URL, which should correctly cover non-Claude Anthropic-compatible models without hardcoding family names. The fix should be applied to the run_agent.py file, and regression tests should be added to ensure the fix works correctly.

Recommendation

Apply the workaround by modifying the _build_api_kwargs method to inject max_tokens based on the model, as described in the proposed fix. This should resolve the issue for users serving Anthropic-compatible models through non-OpenRouter/non-Nous proxies.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING