openclaw - ✅(Solved) Fix TUI: WebSocket connection silently stalls — messages stuck in client, no disconnect detected [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#59627Fetched 2026-04-08 02:42:18
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Root Cause

Examined the source code in method-scopes-0x4tgmV6.js. The current tick-watch mechanism is passive only:

startTickWatch() {
    // ...
    this.tickTimer = setInterval(() => {
        if (Date.now() - this.lastTick > this.tickIntervalMs * 2)
            this.ws?.close(4e3, "tick timeout");
    }, interval);
}

This only checks whether the gateway is still sending ticks to the client. It does not detect:

  1. Client-side event loop stalls — if the Node.js event loop is blocked, incoming ticks may sit in the buffer and lastTick never updates, but setInterval also does not fire, so the stall goes undetected.
  2. Half-open connections — the TCP connection may be alive (ticks arriving at OS level) but the application layer is not processing them.
  3. Send-side failures — there is no mechanism to detect that outbound chat.send requests are failing or timing out silently.

Fix Action

Fixed

PR fix notes

PR #59800: fix(tui): preserve pending sends and busy-state visibility

Description (problem / solution / changelog)

Summary

  • Problem: the TUI could lose track of optimistic local sends during history reload/reconnect paths, show confusing busy/error state during fallback/terminal-error transitions, and waste horizontal width on long links and paths.
  • Why it matters: users could see prompts disappear and later reappear, get stuck in unclear run state, and struggle to read or copy long terminal output.
  • What changed: pending local user turns are preserved and reconciled through transcript rebuilds, active-run/error cleanup is more coherent, Esc/editor handling is covered more directly, and chat rendering reclaims width for long links and paths.
  • What did NOT change (scope boundary): this PR does not add full Pi-style runtime-owned steer/follow-up queues or a new pending queue panel; it stays focused on stabilizing the existing TUI state model.

Change Type (select all)

  • Bug fix
  • Feature
  • Refactor required for the fix
  • Docs
  • Security hardening
  • Chore/infra

Scope (select all touched areas)

  • Gateway / orchestration
  • Skills / tool execution
  • Auth / tokens
  • Memory / storage
  • Integrations
  • API / contracts
  • UI / DX
  • CI/CD / infra

Linked Issue/PR

  • Related #59014
  • Related #59627
  • Related #59570
  • Related #55300
  • This PR fixes a bug or regression

Root Cause / Regression History (if applicable)

  • Root cause: the TUI relied on optimistic local-send state that was not durably reconciled with history rebuilds and run lifecycle transitions, so reload/error paths could desynchronize visible user turns from real run state.
  • Missing detection / guardrail: there was no focused coverage for pending-send reconciliation across history rebuilds and not enough direct tests around the TUI error/final cleanup paths.
  • Prior context (git blame, prior PR, issue, or refactor if known): issue reports in #59014, #59627, #59570, and #55300 all point at state-coherence failures in the TUI.
  • Why this regressed now: the optimistic-send path assumed fast run attribution and simple finalization, which breaks under reconnect/history replay and fallback/error timing.
  • If unknown, what was ruled out: not just markdown rendering; the disappearing-send behavior came from TUI transcript/state handling.

Regression Test Plan (if applicable)

  • Coverage level that should have caught this:
    • Unit test
    • Seam / integration test
    • End-to-end test
    • Existing coverage already sufficient
  • Target test or file: src/tui/components/chat-log.test.ts, src/tui/tui-command-handlers.test.ts, src/tui/tui-event-handlers.test.ts, src/tui/tui-session-actions.test.ts, src/tui/tui.test.ts, src/tui/components/custom-editor.test.ts
  • Scenario the test should lock in: pending local sends survive history rebuilds until a matching run is anchored or dropped, run/error cleanup returns the TUI to a coherent state, and the editor key handling stays stable.
  • Why this is the smallest reliable guardrail: these are TUI-local state-machine bugs, so focused unit coverage hits the failure path directly without network flake.
  • Existing test that already covers this (if any): N/A
  • If no new test is added, why not: N/A

User-visible / Behavior Changes

  • Pending local sends no longer disappear during history reload/reconnect windows.
  • Busy/error state is less likely to get stuck or look idle at the wrong time.
  • Long links and paths get more usable width in chat rendering.
  • TUI editor key handling now has direct regression coverage.

Diagram (if applicable)

Before:
[user send] -> [optimistic local state] -> [history reload or error transition] -> [message/status can disappear or desync]

After:
[user send] -> [tracked pending local state] -> [history rebuild reconciles it] -> [run anchors or drops explicitly] -> [status stays coherent]

Security Impact (required)

  • New permissions/capabilities? (Yes/No): No
  • Secrets/tokens handling changed? (Yes/No): No
  • New/changed network calls? (Yes/No): No
  • Command/tool execution surface changed? (Yes/No): No
  • Data access scope changed? (Yes/No): No
  • If any Yes, explain risk + mitigation: N/A

Repro + Verification

Environment

  • OS: macOS
  • Runtime/container: Node 22 / pnpm worktree
  • Model/provider: N/A
  • Integration/channel (if any): TUI + gateway chat client
  • Relevant config (redacted): default TUI config

Steps

  1. Send a local prompt, then trigger a session reload/reconnect or history rebuild path.
  2. Exercise error/final cleanup paths for the active run.
  3. Render long links/paths in the TUI transcript.

Expected

  • Pending local sends remain visible and reconcile cleanly.
  • Busy/error state returns to a sensible status.
  • Long terminal-style text keeps more horizontal width.

Actual

  • Before this change, pending sends could disappear and later reappear, busy/error handling could become misleading, and transcript padding wasted horizontal space.

Evidence

  • Failing test/log before + passing after
  • Trace/log snippets
  • Screenshot/recording
  • Perf numbers (if relevant)

Human Verification (required)

  • Verified scenarios: focused TUI tests passed, pnpm build passed, and the rebased branch was run locally for interactive TUI validation.
  • Edge cases checked: pending-send reconciliation on history rebuild, no-active-run Esc/abort handling, and error/final cleanup paths.
  • What you did not verify: full Pi-style queued steer/follow-up runtime semantics; that remains follow-up work.

Review Conversations

  • I replied to or resolved every bot review conversation I addressed in this PR.
  • I left unresolved only the conversations that still need reviewer or maintainer judgment.

If a bot review conversation is addressed by this PR, resolve that conversation yourself. Do not leave bot review conversation cleanup for maintainers.

Compatibility / Migration

  • Backward compatible? (Yes/No): Yes
  • Config/env changes? (Yes/No): No
  • Migration needed? (Yes/No): No
  • If yes, exact upgrade steps: N/A

Risks and Mitigations

  • Risk: TUI state handling remains easy to regress because optimistic sends, history replay, and run lifecycle events are loosely coupled.
    • Mitigation: added focused tests around chat-log reconciliation, editor handling, and event/session cleanup.
  • Risk: full Pi-style queue/runtime semantics are still not present.
    • Mitigation: call that scope boundary out explicitly rather than implying this PR solves the larger parity effort.

Changed files

  • CHANGELOG.md (modified, +1/-0)
  • src/tui/components/assistant-message.ts (modified, +1/-1)
  • src/tui/components/chat-log.test.ts (modified, +57/-0)
  • src/tui/components/chat-log.ts (modified, +104/-1)
  • src/tui/components/custom-editor.test.ts (added, +32/-0)
  • src/tui/components/custom-editor.ts (modified, +5/-0)
  • src/tui/components/markdown-message.ts (modified, +5/-4)
  • src/tui/components/pending-messages.test.ts (added, +25/-0)
  • src/tui/components/pending-messages.ts (added, +35/-0)
  • src/tui/tui-types.ts (modified, +9/-0)

Code Example

startTickWatch() {
    // ...
    this.tickTimer = setInterval(() => {
        if (Date.now() - this.lastTick > this.tickIntervalMs * 2)
            this.ws?.close(4e3, "tick timeout");
    }, interval);
}
RAW_BUFFERClick to expand / collapse

Bug Description

When using openclaw tui, the WebSocket connection can silently stall. The TUI appears connected (no disconnect message shown), but user-typed messages are not delivered to the gateway. The user has to close and reopen the TUI window to restore communication.

Steps to Reproduce

  1. Open openclaw tui and chat normally
  2. Wait for a period (observed after ~5-10 minutes of the agent processing long tasks)
  3. Type a message — it appears in the TUI input but never reaches the gateway
  4. Type more messages — same result, no response from the agent
  5. Open a new terminal, run openclaw tui again
  6. All previously stuck messages are delivered at once, and the agent responds to all of them

Evidence from Gateway Logs

Gateway logs show the original TUI connection (conn=26932be5…5dd4) had its last activity at 18:12:48, then went completely silent. No chat.send requests were received between 18:12 and 18:16, despite the user actively typing messages during that window. A new connection (conn=bffeaa43…37ba) appeared at 18:16:40 when the user reopened TUI, and all queued messages arrived.

Root Cause Analysis

Examined the source code in method-scopes-0x4tgmV6.js. The current tick-watch mechanism is passive only:

startTickWatch() {
    // ...
    this.tickTimer = setInterval(() => {
        if (Date.now() - this.lastTick > this.tickIntervalMs * 2)
            this.ws?.close(4e3, "tick timeout");
    }, interval);
}

This only checks whether the gateway is still sending ticks to the client. It does not detect:

  1. Client-side event loop stalls — if the Node.js event loop is blocked, incoming ticks may sit in the buffer and lastTick never updates, but setInterval also does not fire, so the stall goes undetected.
  2. Half-open connections — the TCP connection may be alive (ticks arriving at OS level) but the application layer is not processing them.
  3. Send-side failures — there is no mechanism to detect that outbound chat.send requests are failing or timing out silently.

Suggested Fix

  1. Active client-side ping: Send periodic WebSocket pings from the TUI client and expect pongs within a timeout. If no pong is received, force-close and reconnect.
  2. Send timeout detection: If a chat.send request does not receive a response within N seconds, show a warning in the TUI and attempt reconnection.
  3. Input watchdog: If the user types a message and no gateway acknowledgment arrives within a reasonable timeout, display a "connection may be stale" warning.

Environment

  • OpenClaw version: 2026.3.23-1
  • OS: macOS (Apple Silicon)
  • Connection: local WebSocket (ws://127.0.0.1:18789)
  • TUI client: openclaw tui

extent analysis

TL;DR

Implement an active client-side ping mechanism to detect and recover from silent WebSocket connection stalls.

Guidance

  • Introduce a periodic WebSocket ping from the TUI client to the gateway, expecting a pong response within a specified timeout.
  • Implement send timeout detection for chat.send requests, displaying a warning and attempting reconnection if no response is received within a reasonable timeframe.
  • Consider adding an input watchdog to detect stale connections when user-typed messages do not receive gateway acknowledgments.

Example

// Example of active client-side ping
setInterval(() => {
    ws.ping();
    const pingTimeout = setTimeout(() => {
        if (ws.readyState === 1) { // Connection is still open
            console.log('Ping timeout, forcing reconnect');
            ws.close(1000, 'ping timeout');
        }
    }, 30000); // 30-second ping timeout
    ws.on('pong', () => {
        clearTimeout(pingTimeout);
    });
}, 10000); // Send ping every 10 seconds

Notes

The provided example is a basic illustration and may require adjustments based on the specific WebSocket library and implementation details.

Recommendation

Apply the suggested workaround by implementing an active client-side ping mechanism to detect and recover from silent connection stalls, as this directly addresses the identified issue of the TUI client not detecting when the WebSocket connection has become unresponsive.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING