openclaw - ✅(Solved) Fix Embedded Pi agent enters compaction loop on repeated 400 errors with no response body (openai-completions API) [2 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#66462Fetched 2026-04-15 06:26:08
View on GitHub
Comments
0
Participants
1
Timeline
6
Reactions
0
Participants
Timeline (top)
referenced ×2closed ×1cross-referenced ×1renamed ×1

Error Message

  1. OpenClaw classifies this as a format error → triggers compaction → retries → hits the same 400 again → loop
  2. Framework classifies as format error → triggers compaction → retries → loops

Error Log Pattern (repeated):

error="LLM request failed: provider rejected the request schema or tool payload. rawError=400 status code (no body)" 3. The 400 has no response body, so the failover system cannot classify the actual error — it defaults to format classification via failover-policy.ts: 4. Compaction loop: The format error triggers compaction, which retries with a modified conversation state, hits the same 400, and loops until the compaction safeguard intervenes. The 400 error from the proxy may be caused by:

1. Better error classification for 400 with no body

Suggestion: When message is empty or too short to classify, return null (unknown) instead of "format", so the failover decision can surface the error rather than entering a compaction loop. If compaction is triggered but the underlying error persists (same status code, same provider), the safeguard should activate sooner rather than looping multiple times.

Root Cause

The 400 error from the proxy may be caused by:

Fix Action

Workaround

Disable active-memory (enabled: false) or route the affected channel to a non-Claude model.

PR fix notes

PR #66473: fix: don't classify 400/422 with no body as format error

Description (problem / solution / changelog)

Problem

When a provider behind a proxy returns 400 or 422 with no response body, the failover system defaults to "format" classification. This triggers a compaction loop:

400 no-body → classified as "format" → compaction → retry → 400 again → compaction → loop

See issue #66462 for the full error log pattern.

Fix

In src/agents/pi-embedded-helpers/errors.ts:

  • 400/422 with no body → return null (unknown), don't default to "format"
  • 400/422 with unclassifiable body → still return "format" (preserves existing behavior for actual schema errors)

This prevents the compaction loop while keeping the format error classification for cases where the provider actually returns a meaningful error message.

Changes

FileChange
src/agents/pi-embedded-helpers/errors.tsAdd empty-body check before defaulting to "format"
src/agents/failover-error.test.tsUpdate test expectations for no-body 400/422

Tests

51 passed (failover-error.test.ts)
12 passed (failover-matches.test.ts)

Changed files


PR #67024: fix: don't classify 400/422 with no body as format error

Description (problem / solution / changelog)

Summary:

  • stop body-less HTTP 400/422 proxy failures from defaulting to "format"
  • keep the fix scoped to failover classification and tests
  • handle explicit wrapper-only shapes like 400 status code (no body)

Credit:

  • original fix and report path came from @HongzhuLiu in #66473; this PR is the cleaned, rebased maintainer replacement

Changes:

  • return null for empty or explicit no-body 400/422 wrappers in failover classification
  • update failover classifier regressions for raw and structured no-body shapes
  • keep the changelog note in Unreleased > Fixes

Validation:

  • pnpm test src/agents/failover-error.test.ts src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts src/agents/pi-embedded-runner/run/failover-policy.test.ts

Linked Issues:

  • supersedes #66473
  • fixes #66462

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/agents/failover-error.test.ts (modified, +97/-2)
  • src/agents/failover-error.ts (modified, +134/-31)
  • src/agents/pi-embedded-helpers.isbillingerrormessage.test.ts (modified, +20/-2)
  • src/agents/pi-embedded-helpers/errors.ts (modified, +35/-0)

Code Example

[agent/embedded] embedded run failover decision: runId=active-memory-mnxtzi9n-5b61fdfd 
  stage=assistant decision=surface_error reason=timeout 
  provider=custom-provider/claude-opus-4-6 profile=-

[agent/embedded] embedded run agent end: runId=13e4ecb9-bc1b-44d3-bb92-375feddf2bb5 
  isError=true model=claude-opus-4-6 provider=custom-provider 
  error="LLM request failed: provider rejected the request schema or tool payload. rawError=400 status code (no body)"

[agent/embedded] embedded run failover decision: runId=13e4ecb9-bc1b-44d3-bb92-375feddf2bb5 
  stage=assistant decision=surface_error reason=format provider=custom-provider/claude-opus-4-6 profile=-

[llm-idle-timeout] custom-provider/claude-opus-4-6 produced no reply before the idle watchdog; retrying same model

[compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; 
  writing compaction boundary to suppress re-trigger loop.

---

curl -X POST "https://<proxy>/v1/chat/completions" \
     -H "Authorization: Bearer sk-xxx" \
     -d '{"model":"claude-opus-4-6","messages":[{"role":"user","content":"hi"}],"max_tokens":100}'

---

if (status === 400 || status === 422) {
     if (messageClassification) return messageClassification;
     return toReasonClassification("format");
   }

---

if (status === 400 || status === 422) {
  if (messageClassification) return messageClassification;
  return toReasonClassification("format");  // ← defaults to format even with no body
}
RAW_BUFFERClick to expand / collapse

Issue Description

When using a custom provider with api: "openai-completions" that proxies to Anthropic Claude (or other reasoning-capable models), the embedded Pi agent (used by active-memory, compaction, and other sub-agent flows) can enter a compaction loop when the provider returns a 400 status code with no response body.

This is a general issue affecting any openai-completions provider that sits behind a proxy or gateway, where:

  1. The initial request times out or hits an idle timeout
  2. The retry attempt receives a 400 with no body
  3. OpenClaw classifies this as a format error → triggers compaction → retries → hits the same 400 again → loop

Reproduction Steps

  1. Configure a provider with api: "openai-completions" pointing to a proxy/gateway that supports Claude or other reasoning-capable models
  2. Enable active-memory plugin (default timeout 8000ms)
  3. Send messages that trigger active-memory recall
  4. The embedded Pi agent times out on the initial call, retries the same model, and receives 400 status code (no body) on retry
  5. Framework classifies as format error → triggers compaction → retries → loops

Observed Behavior

Error Log Pattern (repeated):

[agent/embedded] embedded run failover decision: runId=active-memory-mnxtzi9n-5b61fdfd 
  stage=assistant decision=surface_error reason=timeout 
  provider=custom-provider/claude-opus-4-6 profile=-

[agent/embedded] embedded run agent end: runId=13e4ecb9-bc1b-44d3-bb92-375feddf2bb5 
  isError=true model=claude-opus-4-6 provider=custom-provider 
  error="LLM request failed: provider rejected the request schema or tool payload. rawError=400 status code (no body)"

[agent/embedded] embedded run failover decision: runId=13e4ecb9-bc1b-44d3-bb92-375feddf2bb5 
  stage=assistant decision=surface_error reason=format provider=custom-provider/claude-opus-4-6 profile=-

[llm-idle-timeout] custom-provider/claude-opus-4-6 produced no reply before the idle watchdog; retrying same model

[compaction-safeguard] Compaction safeguard: no real conversation messages to summarize; 
  writing compaction boundary to suppress re-trigger loop.

Key Observations:

  1. Simple curl requests succeed (200 OK):

    curl -X POST "https://<proxy>/v1/chat/completions" \
      -H "Authorization: Bearer sk-xxx" \
      -d '{"model":"claude-opus-4-6","messages":[{"role":"user","content":"hi"}],"max_tokens":100}'
  2. Complex Pi agent payload fails on retry (400 with no body): The embedded Pi agent sends:

    • Long system prompt (full agent bootstrap context, bootstrapContextMode: "lightweight")
    • Tools array (memory_search, memory_get)
    • thinking parameter (mapped from thinkLevel: "adaptive")
    • stream: true, max_tokens: 32000
  3. The 400 has no response body, so the failover system cannot classify the actual error — it defaults to format classification via failover-policy.ts:

    if (status === 400 || status === 422) {
      if (messageClassification) return messageClassification;
      return toReasonClassification("format");
    }
  4. Compaction loop: The format error triggers compaction, which retries with a modified conversation state, hits the same 400, and loops until the compaction safeguard intervenes.

Possible Causes

The 400 error from the proxy may be caused by:

  1. tools + thinking combination: When openai-completions format is used with a Claude model, the proxy may not correctly translate the combination of tools array + thinking parameter to the underlying Anthropic Messages API format.

  2. Idle timeout retry payload corruption: The retry after idle timeout may replay a malformed conversation state — for example, thinkingSignature: "reasoning_text" blocks that Claude rejects on replay (noted in attempt.ts line 1137: "Anthropic Claude endpoints can reject replayed thinking blocks on any follow-up provider").

  3. Large system prompt + tools limit: The embedded Pi agent lightweight bootstrap context includes a full system prompt. Combined with tools, this may exceed some proxy limit.

Environment

  • OpenClaw version: 2026.4.14-beta.1 (6823a6f)
  • OS: Darwin 25.3.0 (arm64)
  • Provider: Custom proxy with api: "openai-completions" → Claude Opus 4.6
  • Config: active-memory enabled, timeoutMs: 8000

Suggested Improvements

1. Better error classification for 400 with no body

When the provider returns 400 with no body, the failover system should NOT default to format classification, which triggers compaction. A 400 with no body could be a transient proxy issue, auth problem, or payload size limit — compaction is unlikely to help.

Current code (errors.ts ~line 610):

if (status === 400 || status === 422) {
  if (messageClassification) return messageClassification;
  return toReasonClassification("format");  // ← defaults to format even with no body
}

Suggestion: When message is empty or too short to classify, return null (unknown) instead of "format", so the failover decision can surface the error rather than entering a compaction loop.

2. Limit compaction retries on repeated identical errors

If compaction is triggered but the underlying error persists (same status code, same provider), the safeguard should activate sooner rather than looping multiple times.

3. Idle timeout retry should validate conversation state

Before retrying after idle timeout, validate that the conversation state is replayable — specifically check for incompatible thinkingSignature blocks that downstream providers may reject.

4. Documentation for openai-completions + Claude

Clarify which payload features are supported when using openai-completions API with Anthropic Claude models behind a proxy (tools, thinking, system prompt length limits, etc.).

Workaround

Disable active-memory (enabled: false) or route the affected channel to a non-Claude model.

extent analysis

TL;DR

Modify the error classification for 400 status codes with no response body to prevent defaulting to "format" errors, which trigger compaction loops.

Guidance

  1. Update error classification: Change the errors.ts file to return null (unknown) instead of "format" when the provider returns a 400 status code with no response body.
  2. Implement compaction retry limits: Introduce a limit on the number of compaction retries for repeated identical errors to prevent infinite loops.
  3. Validate conversation state on retry: Before retrying after an idle timeout, validate the conversation state to ensure it's replayable and compatible with the downstream provider.
  4. Review and adjust payload features: Verify which payload features are supported when using the openai-completions API with Anthropic Claude models behind a proxy.

Example

// Updated error classification in errors.ts
if (status === 400 || status === 422) {
  if (messageClassification) return messageClassification;
  if (!responseBody) return null; // Return null for 400 with no body
  return toReasonClassification("format");
}

Notes

The provided suggestions focus on addressing the immediate issue of compaction loops caused by 400 status codes with no response body. Further investigation into the root cause of the 400 errors (e.g., payload size limits, tool combinations, or thinking parameter issues) may be necessary to fully resolve the problem.

Recommendation

Apply the workaround by disabling active-memory or routing the affected channel to a non-Claude model until the suggested changes can be implemented and tested.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING