openclaw - 💡(How to fix) Fix ACP diagnostics blur Gemini capacity stalls with dead-session signals [2 comments, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#55028Fetched 2026-04-08 01:33:31
View on GitHub
Comments
2
Participants
1
Timeline
2
Reactions
0
Author
Participants
Timeline (top)
commented ×2

Error Message

Even if the exact provider error cannot always be normalized perfectly, a clearer distinction between slow/stalled, capacity-limited, and actually dead would make ACP debugging much easier.

Root Cause

The current messages can push operators toward the wrong conclusion:

  • treating provider capacity stress like local ACP breakage
  • treating a recoverable run like a dead session
  • or treating a generic exit code as a local regression
RAW_BUFFERClick to expand / collapse

Summary

When an ACP-backed Gemini run is slow or capacity-stressed, OpenClaw can surface a confusing mix of symptoms that make a recoverable provider/backoff situation look like local ACP/session failure.

In one real sessions_spawn(runtime="acp", agentId="gemini") smoke test on OpenClaw 2026.3.24, the same run emitted all of the following in sequence:

  1. a no-output warning:
    • gemini has produced no output for 60s. It may be waiting for interactive input.
  2. ACP session-health warnings suggesting the session was dead/unavailable:
    • acpx ensureSession replacing dead named session ... summary=queue owner unavailable
  3. and then later the run resumed output and completed successfully.

That makes it hard to tell whether the system is dealing with:

  • a truly dead ACP/session path
  • a slow-but-recovering child run
  • or upstream Gemini quota/capacity/backoff behavior

Environment

  • OpenClaw: 2026.3.24
  • Install kind: global npm/pnpm package install
  • OS: Ubuntu 22.04 LTS (arm64)
  • ACP agent under test: Gemini

Why this seems like a real diagnostics/surfacing gap

The underlying Gemini/ACP path does not appear broadly broken:

  • normal shell-side update verification was healthy
  • the ACP smoke test eventually completed successfully

At the same time, direct Gemini/ACP testing can fail with more specific upstream errors such as:

  • daily quota exhausted
  • RESOURCE_EXHAUSTED
  • MODEL_CAPACITY_EXHAUSTED
  • provider-side capacity exhaustion messages

So the current operator-facing surface appears to collapse several distinct states into very generic symptoms:

  • no output for 60s
  • may be waiting for interactive input
  • queue owner unavailable
  • dead named session
  • generic acpx exited with code 1

Expected behavior

OpenClaw should distinguish these cases more clearly during ACP-backed runs:

  1. no output yet / still starting
  2. provider capacity or quota exhaustion
  3. retry / fallback in progress
  4. session actually dead / unrecoverable
  5. output resumed after temporary stall

Actual behavior

A single capacity-stressed or slow Gemini ACP run can look like a local ACP/session failure before later succeeding.

Why this matters

The current messages can push operators toward the wrong conclusion:

  • treating provider capacity stress like local ACP breakage
  • treating a recoverable run like a dead session
  • or treating a generic exit code as a local regression

Suggested direction

Instead of only surfacing generic stall/dead-session symptoms, propagate more structured ACP child/runtime state upward when available, for example:

  • capacity_exhausted
  • quota_exhausted
  • retrying
  • falling_back
  • all_lanes_exhausted
  • output_resumed

Even if the exact provider error cannot always be normalized perfectly, a clearer distinction between slow/stalled, capacity-limited, and actually dead would make ACP debugging much easier.

Related / nearby issues

These may be adjacent but do not seem to cover this exact diagnostics framing:

  • #15287
  • #37869
  • #43496

If useful, I can provide a tighter sanitized repro packet with the exact user-visible message sequence and the eventual-success outcome.

extent analysis

Fix Plan

To address the issue of unclear symptoms for ACP-backed Gemini runs, we need to modify the OpenClaw code to propagate more structured ACP child/runtime state upward. Here are the steps:

  • Modify the sessions_spawn function to catch and handle specific error codes from the Gemini ACP run, such as RESOURCE_EXHAUSTED and MODEL_CAPACITY_EXHAUSTED.
  • Introduce new error codes or messages to distinguish between different states, such as:
    • capacity_exhausted
    • quota_exhausted
    • retrying
    • falling_back
    • all_lanes_exhausted
    • output_resumed
  • Update the logging mechanism to surface these new error codes and messages to the operator.

Example code snippet:

def sessions_spawn(runtime="acp", agentId="gemini"):
    try:
        # existing code
    except Exception as e:
        if "RESOURCE_EXHAUSTED" in str(e):
            logging.error("Capacity exhausted")
            return "capacity_exhausted"
        elif "MODEL_CAPACITY_EXHAUSTED" in str(e):
            logging.error("Quota exhausted")
            return "quota_exhausted"
        # handle other specific error codes
    # existing code

    # introduce new logging for retrying and falling back
    if retrying:
        logging.info("Retrying")
        return "retrying"
    elif falling_back:
        logging.info("Falling back")
        return "falling_back"

Verification

To verify the fix, run the sessions_spawn function with a Gemini ACP run that is expected to encounter capacity or quota exhaustion. Check the logs for the new error codes and messages, and ensure that they are correctly surfaced to the operator.

Extra Tips

  • Make sure to handle the new error codes and messages in the operator-facing UI to provide clear and actionable information.
  • Consider introducing a retry mechanism with exponential backoff to handle temporary capacity or quota exhaustion.
  • Review the related issues (#15287, #37869, #43496) to ensure that this fix does not introduce any regressions.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

OpenClaw should distinguish these cases more clearly during ACP-backed runs:

  1. no output yet / still starting
  2. provider capacity or quota exhaustion
  3. retry / fallback in progress
  4. session actually dead / unrecoverable
  5. output resumed after temporary stall

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING