hermes - 💡(How to fix) Fix Auxiliary context compression sends max_tokens to GitHub Copilot GPT-5 models [3 pull requests]

StepCodex · 2026-05-29T09:45:32Z

[hermes] Bug Description When Hermes performs automatic context compression using auxiliary.compression with the GitHub Copilot provider and a GPT-5 series mod… ## Fixed - Fixed by PR: fix(auxiliary): use max_completion_tokens for Copilot GPT-5 compression (https://github.com/NousResearch/hermes-agent/pull/34532) - Fixed by PR: fix(auxiliary): send max_completion_tokens to GitHub Copilot for GPT-5 models (https://github.com/NousResearch/hermes-agent/pull/34534) - Fixed by PR: fix(auxiliary): stop capping output with max_tokens by default (#34530) (https://github.com/NousResearch/hermes-agent/pull/34845) ## Bug Description When Hermes performs automatic context compression using `auxiliary.compression` with the GitHub Copilot provider and a GPT-5 series model, the compression summary request fails with HTTP 400 because Hermes sends `max_tokens`. The Copilot/OpenAI-compatible endpoint rejects this parameter for newer OpenAI models and requires `max_completion_tokens` instead. As a result, context compression falls back to inserting a static fallback context marker, and middle conversation turns are removed without a semantic summary. ## Error ```text ⚠ Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker. ``` ## Environment / Config ```yaml model: default: gpt-5.5 provider: github-copilot base_url: https://api.githubcopilot.com api_mode: chat_completions auxiliary: compression: provider: copilot model: gpt-5.4 ``` Hermes source checkout observed at: ```text 300140e00 ``` Hermes version observed: ```text Hermes Agent v0.15.1 (2026.5.29) Python: 3.11.14 OpenAI SDK: 2.24.0 ``` ## Steps to Reproduce 1. Configure the main model to use GitHub Copilot with a GPT-5 series model. 2. Configure auxiliary compression to use Copilot with a GPT-5 series model, for example: ```yaml auxiliary: compression: provider: copilot model: gpt-5.4 ``` 3. Let a conversation grow until automatic context compression triggers. 4. Observe that compression summary generation fails with a 400 error about `max_tokens`. ## Expected Behavior Auxiliary LLM calls should use the correct completion-token parameter for the resolved provider/model/backend. For GitHub Copilot / `https://api.githubcopilot.com` and newer OpenAI-family models, Hermes should send: ```json { "max_completion_tokens": 1234 } ``` instead of: ```json { "max_tokens": 1234 } ``` Compression summary generation should succeed and preserve the middle conversation turns as a semantic summary. ## Actual Behavior Hermes sends `max_tokens` from the auxiliary compression path. The provider rejects it: ```text Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead. ``` Then Hermes inserts a fallback context marker and drops the middle compression window without a generated summary. ## Impact This causes context loss during compression: - recent tail messages are preserved; - some earlier/middle conversation turns are removed; - no semantic summary is generated for the removed turns; - the assistant may lose details about prior actions, file paths, command outputs, decisions, or resolved questions. The filesystem/session state is not rolled back, but conversational continuity degrades. ## Suspected Root Cause The main model path appears to handle this correctly through logic similar to `_max_tokens_param()` in `run_agent.py`, which returns `max_completion_tokens` for direct OpenAI/Azure/Copilot-compatible endpoints. However, the auxiliary path in `agent/auxiliary_client.py` appears to still default to `max_tokens` for non-custom providers: ```python elif provider == "custom": custom_base = base_url or _current_custom_base_url() if base_url_hostname(custom_base) == "api.openai.com": kwargs["max_completion_tokens"] = max_tokens else: kwargs["max_tokens"] = max_tokens else: kwargs["max_tokens"] = max_tokens ``` This means `provider: copilot` + `base_url: https://api.githubcopilot.com` still sends `max_tokens`. ## Suggested Fix Update the auxiliary LLM call kwargs builder to use `max_completion_tokens` for GitHub Copilot / OpenAI-compatible GPT-5 endpoints, matching the main-agent behavior. For example, centralize the max-token-parameter selection logic so both main calls and auxiliary calls use the same provider/backend-aware function. At minimum, `_build_call_kwargs()` should detect Copilot / `api.githubcopilot.com` and use: ```python kwargs["max_completion_tokens"] = max_tokens ``` instead of `max_tokens`. ## Related Issues Possibly related in theme, but not duplicate: - #15916 — auxiliary memory flush can send unsupported temperature to ChatGPT Codex backend - #23975 — context compression can be interrupted by gateway messages, causing fallback summary marker

⚠ Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker.

Fix Action

Fixed

Fixed by PR: fix(auxiliary): use max_completion_tokens for Copilot GPT-5 compression (https://github.com/NousResearch/hermes-agent/pull/34532)
Fixed by PR: fix(auxiliary): send max_completion_tokens to GitHub Copilot for GPT-5 models (https://github.com/NousResearch/hermes-agent/pull/34534)
Fixed by PR: fix(auxiliary): stop capping output with max_tokens by default (#34530) (https://github.com/NousResearch/hermes-agent/pull/34845)

Code Example

⚠ Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker.

---

model:
  default: gpt-5.5
  provider: github-copilot
  base_url: https://api.githubcopilot.com
  api_mode: chat_completions

auxiliary:
  compression:
    provider: copilot
    model: gpt-5.4

---

300140e00

---

Hermes Agent v0.15.1 (2026.5.29)
Python: 3.11.14
OpenAI SDK: 2.24.0

---

auxiliary:
     compression:
       provider: copilot
       model: gpt-5.4

---

{
  "max_completion_tokens": 1234
}

---

{
  "max_tokens": 1234
}

---

Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.

---

elif provider == "custom":
    custom_base = base_url or _current_custom_base_url()
    if base_url_hostname(custom_base) == "api.openai.com":
        kwargs["max_completion_tokens"] = max_tokens
    else:
        kwargs["max_tokens"] = max_tokens
else:
    kwargs["max_tokens"] = max_tokens

---

kwargs["max_completion_tokens"] = max_tokens

Bug Description

When Hermes performs automatic context compression using auxiliary.compression with the GitHub Copilot provider and a GPT-5 series model, the compression summary request fails with HTTP 400 because Hermes sends max_tokens.

The Copilot/OpenAI-compatible endpoint rejects this parameter for newer OpenAI models and requires max_completion_tokens instead.

As a result, context compression falls back to inserting a static fallback context marker, and middle conversation turns are removed without a semantic summary.

Error

⚠ Compression summary failed: Error code: 400 - {'error': {'message': "Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.", 'code': 'invalid_request_body'}}. Inserted a fallback context marker.

Environment / Config

model:
  default: gpt-5.5
  provider: github-copilot
  base_url: https://api.githubcopilot.com
  api_mode: chat_completions

auxiliary:
  compression:
    provider: copilot
    model: gpt-5.4

Hermes source checkout observed at:

300140e00

Hermes version observed:

Hermes Agent v0.15.1 (2026.5.29)
Python: 3.11.14
OpenAI SDK: 2.24.0

Steps to Reproduce

Configure the main model to use GitHub Copilot with a GPT-5 series model.
Configure auxiliary compression to use Copilot with a GPT-5 series model, for example:
```
auxiliary:
  compression:
    provider: copilot
    model: gpt-5.4
```
Let a conversation grow until automatic context compression triggers.
Observe that compression summary generation fails with a 400 error about max_tokens.

Expected Behavior

Auxiliary LLM calls should use the correct completion-token parameter for the resolved provider/model/backend.

For GitHub Copilot / https://api.githubcopilot.com and newer OpenAI-family models, Hermes should send:

{
  "max_completion_tokens": 1234
}

instead of:

{
  "max_tokens": 1234
}

Compression summary generation should succeed and preserve the middle conversation turns as a semantic summary.

Actual Behavior

Hermes sends max_tokens from the auxiliary compression path. The provider rejects it:

Unsupported parameter: 'max_tokens' is not supported with this model. Use 'max_completion_tokens' instead.

Then Hermes inserts a fallback context marker and drops the middle compression window without a generated summary.

Impact

This causes context loss during compression:

recent tail messages are preserved;
some earlier/middle conversation turns are removed;
no semantic summary is generated for the removed turns;
the assistant may lose details about prior actions, file paths, command outputs, decisions, or resolved questions.

The filesystem/session state is not rolled back, but conversational continuity degrades.

Suspected Root Cause

The main model path appears to handle this correctly through logic similar to _max_tokens_param() in run_agent.py, which returns max_completion_tokens for direct OpenAI/Azure/Copilot-compatible endpoints.

However, the auxiliary path in agent/auxiliary_client.py appears to still default to max_tokens for non-custom providers:

elif provider == "custom":
    custom_base = base_url or _current_custom_base_url()
    if base_url_hostname(custom_base) == "api.openai.com":
        kwargs["max_completion_tokens"] = max_tokens
    else:
        kwargs["max_tokens"] = max_tokens
else:
    kwargs["max_tokens"] = max_tokens

This means provider: copilot + base_url: https://api.githubcopilot.com still sends max_tokens.

Suggested Fix

Update the auxiliary LLM call kwargs builder to use max_completion_tokens for GitHub Copilot / OpenAI-compatible GPT-5 endpoints, matching the main-agent behavior.

For example, centralize the max-token-parameter selection logic so both main calls and auxiliary calls use the same provider/backend-aware function.

At minimum, _build_call_kwargs() should detect Copilot / api.githubcopilot.com and use:

kwargs["max_completion_tokens"] = max_tokens

instead of max_tokens.

Related Issues

Possibly related in theme, but not duplicate:

#15916 — auxiliary memory flush can send unsupported temperature to ChatGPT Codex backend
#23975 — context compression can be interrupted by gateway messages, causing fallback summary marker

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering