hermes - 💡(How to fix) Fix Auxiliary context compression sends max_tokens to GitHub Copilot GPT-5 models [3 pull requests]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

Error Message

⚠ Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker.

Root Cause

When Hermes performs automatic context compression using auxiliary.compression with the GitHub Copilot provider and a GPT-5 series model, the compression summary request fails with HTTP 400 because Hermes sends max_tokens.

Fix Action

Fixed

Code Example

Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker.

---

model:
  default: gpt-5.5
  provider: github-copilot
  base_url: https://api.githubcopilot.com
  api_mode: chat_completions

auxiliary:
  compression:
    provider: copilot
    model: gpt-5.4

---

300140e00

---

Hermes Agent v0.15.1 (2026.5.29)
Python: 3.11.14
OpenAI SDK: 2.24.0

---

auxiliary:
     compression:
       provider: copilot
       model: gpt-5.4

---

{
  "max_completion_tokens": 1234
}

---

{
  "max_tokens": 1234
}

---

Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.

---

elif provider == "custom":
    custom_base = base_url or _current_custom_base_url()
    if base_url_hostname(custom_base) == "api.openai.com":
        kwargs["max_completion_tokens"] = max_tokens
    else:
        kwargs["max_tokens"] = max_tokens
else:
    kwargs["max_tokens"] = max_tokens

---

kwargs["max_completion_tokens"] = max_tokens
RAW_BUFFERClick to expand / collapse

Bug Description

When Hermes performs automatic context compression using auxiliary.compression with the GitHub Copilot provider and a GPT-5 series model, the compression summary request fails with HTTP 400 because Hermes sends max_tokens.

The Copilot/OpenAI-compatible endpoint rejects this parameter for newer OpenAI models and requires max_completion_tokens instead.

As a result, context compression falls back to inserting a static fallback context marker, and middle conversation turns are removed without a semantic summary.

Error

⚠ Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker.

Environment / Config

model:
  default: gpt-5.5
  provider: github-copilot
  base_url: https://api.githubcopilot.com
  api_mode: chat_completions

auxiliary:
  compression:
    provider: copilot
    model: gpt-5.4

Hermes source checkout observed at:

300140e00

Hermes version observed:

Hermes Agent v0.15.1 (2026.5.29)
Python: 3.11.14
OpenAI SDK: 2.24.0

Steps to Reproduce

  1. Configure the main model to use GitHub Copilot with a GPT-5 series model.

  2. Configure auxiliary compression to use Copilot with a GPT-5 series model, for example:

    auxiliary:
      compression:
        provider: copilot
        model: gpt-5.4
  3. Let a conversation grow until automatic context compression triggers.

  4. Observe that compression summary generation fails with a 400 error about max_tokens.

Expected Behavior

Auxiliary LLM calls should use the correct completion-token parameter for the resolved provider/model/backend.

For GitHub Copilot / https://api.githubcopilot.com and newer OpenAI-family models, Hermes should send:

{
  "max_completion_tokens": 1234
}

instead of:

{
  "max_tokens": 1234
}

Compression summary generation should succeed and preserve the middle conversation turns as a semantic summary.

Actual Behavior

Hermes sends max_tokens from the auxiliary compression path. The provider rejects it:

Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.

Then Hermes inserts a fallback context marker and drops the middle compression window without a generated summary.

Impact

This causes context loss during compression:

  • recent tail messages are preserved;
  • some earlier/middle conversation turns are removed;
  • no semantic summary is generated for the removed turns;
  • the assistant may lose details about prior actions, file paths, command outputs, decisions, or resolved questions.

The filesystem/session state is not rolled back, but conversational continuity degrades.

Suspected Root Cause

The main model path appears to handle this correctly through logic similar to _max_tokens_param() in run_agent.py, which returns max_completion_tokens for direct OpenAI/Azure/Copilot-compatible endpoints.

However, the auxiliary path in agent/auxiliary_client.py appears to still default to max_tokens for non-custom providers:

elif provider == "custom":
    custom_base = base_url or _current_custom_base_url()
    if base_url_hostname(custom_base) == "api.openai.com":
        kwargs["max_completion_tokens"] = max_tokens
    else:
        kwargs["max_tokens"] = max_tokens
else:
    kwargs["max_tokens"] = max_tokens

This means provider: copilot + base_url: https://api.githubcopilot.com still sends max_tokens.

Suggested Fix

Update the auxiliary LLM call kwargs builder to use max_completion_tokens for GitHub Copilot / OpenAI-compatible GPT-5 endpoints, matching the main-agent behavior.

For example, centralize the max-token-parameter selection logic so both main calls and auxiliary calls use the same provider/backend-aware function.

At minimum, _build_call_kwargs() should detect Copilot / api.githubcopilot.com and use:

kwargs["max_completion_tokens"] = max_tokens

instead of max_tokens.

Related Issues

Possibly related in theme, but not duplicate:

  • #15916 — auxiliary memory flush can send unsupported temperature to ChatGPT Codex backend
  • #23975 — context compression can be interrupted by gateway messages, causing fallback summary marker

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING