openclaw - 💡(How to fix) Fix [Bug]: Agents stuck after session reset - showing "writing" indicator but no output

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

The gateway is experiencing severe Node.js event loop starvation. Agents appear frozen because gateway operations are taking 3-7 minutes instead of milliseconds. This is the same symptom pattern as issue #74404 (Gateway CPU-Saturated, Agents Stop Responding), but the system is running a beta version after the supposedly fixed stable release.

Error Message

Evidence

Issue #74404 — Gateway CPU-Saturated, Agents Stop Responding

  • Severity: HIGH
  • Status: CLOSED in 2026.5.2 stable (supposedly fixed)
  • Current version: 2026.5.12-beta.1 (AFTER the stable fix release)
  • Key metric - Event Loop Delays:
    • 21:06: 28,145ms delay
    • 21:10: 52,126ms delay
    • 21:17: 120,807ms delay (2 minutes)
    • 21:22: 145,221ms delay
    • 21:25: 196,270ms delay
    • 21:29: 211,730ms delay (3.5 minutes)
    • 21:34: 277,594ms delay (4.6 minutes)
    • 21:40: 346,821ms delay (5.8 minutes)
    • 21:47: 420,420ms delay (7 minutes) ← User reported Marcus stuck here
    • 21:53: stability check FAILED with 10s timeout

Gateway Response Times (sessions.list)

  • 21:38: 1,816,340ms (30 minutes) for a single sessions.list call
  • 21:53: 2,731,729ms (45 minutes) for sessions.list

Telegram Polling Stalls (multiple)

21:19:50 - Polling stall detected (142.88s stuck) 21:25:51 - Polling stall detected (206.45s stuck) 21:34:21 - Polling stall detected (287.77s stuck)

Gateway Stability Check

21:53:35 - Gateway stability failed: GatewayTransportError: gateway timeout after 10000ms


System Info

ComponentValue
OSLinux, NVMe disk
Node.jsv24.14.1
Memory Total31 GiB
Memory Available26 GiB
Disk Usage70% used (68 GiB available)
Gatewayloopback (127.0.0.1:18789)
Servicesystemd user (pid 1582, state active)

Analysis

The blocklist entry for #74404 states it was "fixed in 2026.5.2 stable". However:

  1. User is running 2026.5.12-beta.1 - a beta version released AFTER the stable fix
  2. The fix appears incomplete or regressed - event loop delays are back to 400+ seconds
  3. Telegram polling is a major contributor - getUpdates calls are timing out and blocking

This suggests either:

  • The #74404 fix was incomplete
  • A regression was introduced after 2026.5.2 stable
  • The Telegram polling issue (#73432 - QMD Embed Timer) is exacerbating the problem

Relevant Blocklist Entries

IssueStatusRelevance
#74404CLOSED (2026.5.2 stable)SAME SYMPTOM - but fix regressed?
#73432OPENTelegram polling stalls
#75501OPENToo many open files (v4.29 regression)

Root Cause

The blocklist entry for #74404 states it was "fixed in 2026.5.2 stable". However:

  1. User is running 2026.5.12-beta.1 - a beta version released AFTER the stable fix
  2. The fix appears incomplete or regressed - event loop delays are back to 400+ seconds
  3. Telegram polling is a major contributor - getUpdates calls are timing out and blocking

This suggests either:

  • The #74404 fix was incomplete
  • A regression was introduced after 2026.5.2 stable
  • The Telegram polling issue (#73432 - QMD Embed Timer) is exacerbating the problem

Code Example

## Evidence

### Issue #74404Gateway CPU-Saturated, Agents Stop Responding
- **Severity:** HIGH
- **Status:** CLOSED in 2026.5.2 stable (supposedly fixed)
- **Current version:** 2026.5.12-beta.1 (AFTER the stable fix release)
- **Key metric - Event Loop Delays:**
  - 21:06: 28,145ms delay
  - 21:10: 52,126ms delay
  - 21:17: 120,807ms delay (2 minutes)
  - 21:22: 145,221ms delay
  - 21:25: 196,270ms delay
  - 21:29: 211,730ms delay (3.5 minutes)
  - 21:34: 277,594ms delay (4.6 minutes)
  - 21:40: 346,821ms delay (5.8 minutes)
  - **21:47: 420,420ms delay (7 minutes)**User reported Marcus stuck here
  - 21:53: stability check FAILED with 10s timeout

### Gateway Response Times (sessions.list)
- 21:38: **1,816,340ms** (30 minutes) for a single sessions.list call
- 21:53: **2,731,729ms** (45 minutes) for sessions.list

### Telegram Polling Stalls (multiple)

21:19:50 - Polling stall detected (142.88s stuck)
21:25:51 - Polling stall detected (206.45s stuck)
21:34:21 - Polling stall detected (287.77s stuck)


### Gateway Stability Check

21:53:35 - Gateway stability failed: GatewayTransportError: gateway timeout after 10000ms


---

## System Info

| Component | Value |
|-----------|-------|
| OS | Linux, NVMe disk |
| Node.js | v24.14.1 |
| Memory Total | 31 GiB |
| Memory Available | 26 GiB |
| Disk Usage | 70% used (68 GiB available) |
| Gateway | loopback (127.0.0.1:18789) |
| Service | systemd user (pid 1582, state active) |

---

## Analysis

The blocklist entry for #74404 states it was "fixed in 2026.5.2 stable". However:

1. **User is running 2026.5.12-beta.1** - a beta version released AFTER the stable fix
2. **The fix appears incomplete or regressed** - event loop delays are back to 400+ seconds
3. **Telegram polling is a major contributor** - getUpdates calls are timing out and blocking

This suggests either:
- The #74404 fix was incomplete
- A regression was introduced after 2026.5.2 stable
- The Telegram polling issue (#73432 - QMD Embed Timer) is exacerbating the problem

---

## Relevant Blocklist Entries

| Issue | Status | Relevance |
|-------|--------|-----------|
| #74404 | CLOSED (2026.5.2 stable) | SAME SYMPTOM - but fix regressed? |
| #73432 | OPEN | Telegram polling stalls |
| #75501 | OPEN | Too many open files (v4.29 regression) |

---

work=[active=agent:marcus:main(processing,q=1,age=186s) queued=agent:marcus:main(processing,q=1,age=186s)]

---

liveness warning: reasons=event_loop_utilization,cpu interval=567s
phase=channels.telegram.start-account
recentPhases=channels.telegram.runtime:0ms,channels.telegram.approval-bootstrap:0ms,
             channels.telegram.start-account:7332155ms,...

---

warn fetch-timeout {"timeoutMs":10000,"elapsedMs":430420,"timerDelayMs":420420,
                    "eventLoopDelayHint":"timer delayed 420420ms, likely event-loop starvation"}

---

const trackedPromise = Promise.resolve().then(() =>
  measureStartup(`channels.${channelId}.start-account`, () =>
    startAccount({ cfg, accountId: id, account, runtime, ... })
  )
);

---

async function fetchWithTimeout(url, init, timeoutMs, fetchFn = fetch) {
  const { signal, cleanup } = buildTimeoutAbortSignal({...});
  try {
    return await fetchFn(url, { ...init, signal });  // <-- BLOCKING
  } finally {
    cleanup();
  }
}

---

// Current (BLOCKING):
const trackedPromise = Promise.resolve().then(() =>
  measureStartup(`channels.${channelId}.start-account`, () =>
    startAccount({ cfg, accountId: id, account, runtime, ... })
  )
);

// Fixed (NON-BLOCKING):
// Run startAccount in a separate microtask to not block the gateway
const trackedPromise = Promise.resolve().then(async () => {
  // Don't await blocking operations in the gateway task queue
  if (channelId === 'telegram') {
    // Spawn as detached task
    setImmediate(() => {
      measureStartup(`channels.${channelId}.start-account`, () =>
        startAccount({ cfg, accountId: id, account, runtime, ... })
      );
    });
    return;
  }
  return measureStartup(`channels.${channelId}.start-account`, () =>
    startAccount({ cfg, accountId: id, account, runtime, ... })
  );
});

---

const circuitBreaker = {
  failures: 0,
  maxFailures: 3,
  resetTimeout: 30000, // 30 seconds

  async call(fn) {
    if (this.failures >= this.maxFailures) {
      throw new Error('Circuit breaker open - Telegram API unavailable');
    }
    try {
      return await fn();
    } catch (err) {
      this.failures++;
      if (this.failures >= this.maxFailures) {
        setTimeout(() => this.failures = 0, this.resetTimeout);
      }
      throw err;
    }
  }
};

// Usage:
meRes = await circuitBreaker.call(() =>
  fetchWithTimeout(`${base}/getMe`, {}, timeoutBudgetMs, fetcher)
);

---

const { Worker } = require('worker_threads');

async function startAccountInWorker(params) {
  return new Promise((resolve, reject) => {
    const worker = new Worker('./telegram-start-worker.js', {
      workerData: params
    });
    worker.on('message', resolve);
    worker.on('error', reject);
    worker.on('exit', (code) => {
      if (code !== 0) reject(new Error(`Worker exited with code ${code}`));
    });
  });
}

---

// Current: timeoutBudgetMs can be very large
meRes = await fetchWithTimeout(`${base}/getMe`, {}, timeoutBudgetMs, fetcher);

// Fixed: Cap at 5 seconds, fail fast
const TEELEGRAM_START_TIMEOUT = 5000; // 5 seconds max
meRes = await fetchWithTimeout(`${base}/getMe`, {}, TEELEGRAM_START_TIMEOUT, fetcher);
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

Yes

Summary

The gateway is experiencing severe Node.js event loop starvation. Agents appear frozen because gateway operations are taking 3-7 minutes instead of milliseconds. This is the same symptom pattern as issue #74404 (Gateway CPU-Saturated, Agents Stop Responding), but the system is running a beta version after the supposedly fixed stable release.

Steps to reproduce

Reset a few agents, at least one will have this issue, in my case all my agents have this issue and needed to downgrade.

Expected behavior

Do I need to explain?

Actual behavior

The gateway is experiencing severe Node.js event loop starvation. Agents appear frozen because gateway operations are taking 3-7 minutes instead of milliseconds. This is the same symptom pattern as issue #74404 (Gateway CPU-Saturated, Agents Stop Responding), but the system is running a beta version after the supposedly fixed stable release.

OpenClaw version

v2026.5.12-beta.1

Operating system

Ubuntu

Install method

NPM

Model

Minimax

Provider / routing chain

Minimax

Additional provider/model setup details

openclaw-2026-05-13.log

Logs, screenshots, and evidence

## Evidence

### Issue #74404 — Gateway CPU-Saturated, Agents Stop Responding
- **Severity:** HIGH
- **Status:** CLOSED in 2026.5.2 stable (supposedly fixed)
- **Current version:** 2026.5.12-beta.1 (AFTER the stable fix release)
- **Key metric - Event Loop Delays:**
  - 21:06: 28,145ms delay
  - 21:10: 52,126ms delay
  - 21:17: 120,807ms delay (2 minutes)
  - 21:22: 145,221ms delay
  - 21:25: 196,270ms delay
  - 21:29: 211,730ms delay (3.5 minutes)
  - 21:34: 277,594ms delay (4.6 minutes)
  - 21:40: 346,821ms delay (5.8 minutes)
  - **21:47: 420,420ms delay (7 minutes)** ← User reported Marcus stuck here
  - 21:53: stability check FAILED with 10s timeout

### Gateway Response Times (sessions.list)
- 21:38: **1,816,340ms** (30 minutes) for a single sessions.list call
- 21:53: **2,731,729ms** (45 minutes) for sessions.list

### Telegram Polling Stalls (multiple)

21:19:50 - Polling stall detected (142.88s stuck)
21:25:51 - Polling stall detected (206.45s stuck)
21:34:21 - Polling stall detected (287.77s stuck)


### Gateway Stability Check

21:53:35 - Gateway stability failed: GatewayTransportError: gateway timeout after 10000ms


---

## System Info

| Component | Value |
|-----------|-------|
| OS | Linux, NVMe disk |
| Node.js | v24.14.1 |
| Memory Total | 31 GiB |
| Memory Available | 26 GiB |
| Disk Usage | 70% used (68 GiB available) |
| Gateway | loopback (127.0.0.1:18789) |
| Service | systemd user (pid 1582, state active) |

---

## Analysis

The blocklist entry for #74404 states it was "fixed in 2026.5.2 stable". However:

1. **User is running 2026.5.12-beta.1** - a beta version released AFTER the stable fix
2. **The fix appears incomplete or regressed** - event loop delays are back to 400+ seconds
3. **Telegram polling is a major contributor** - getUpdates calls are timing out and blocking

This suggests either:
- The #74404 fix was incomplete
- A regression was introduced after 2026.5.2 stable
- The Telegram polling issue (#73432 - QMD Embed Timer) is exacerbating the problem

---

## Relevant Blocklist Entries

| Issue | Status | Relevance |
|-------|--------|-----------|
| #74404 | CLOSED (2026.5.2 stable) | SAME SYMPTOM - but fix regressed? |
| #73432 | OPEN | Telegram polling stalls |
| #75501 | OPEN | Too many open files (v4.29 regression) |

Impact and severity

This version is useless for me; I needed to downgrade.

Additional information

Marcus Stalled Session - Root Cause Analysis

Date: 2026-05-12 23:47 UTC Agent: Marcus (Developer) Version: OpenClaw 2026.5.12-beta.1


Executive Summary

Marcus appeared "stuck" with a writing indicator but no output because the entire gateway event loop was blocked by a long-running Telegram channel startup operation. The Telegram startAccount phase took 7,332,155ms (122 minutes) due to blocking HTTP calls to the Telegram Bot API.


Timeline of Events (Using Marcus as Example)

21:53:xx - Marcus's Session Queued

work=[active=agent:marcus:main(processing,q=1,age=186s) queued=agent:marcus:main(processing,q=1,age=186s)]
  • Marcus's session was queued, waiting 186 seconds for the event loop

21:56:55 - Gateway Liveness Warning

liveness warning: reasons=event_loop_utilization,cpu interval=567s
phase=channels.telegram.start-account
recentPhases=channels.telegram.runtime:0ms,channels.telegram.approval-bootstrap:0ms,
             channels.telegram.start-account:7332155ms,...
  • CRITICAL: channels.telegram.start-account spent 7,332,155ms (122 minutes!) in start-account phase
  • Normal: should take < 5 seconds

21:47:28 - Event Loop Starvation Intensifies

warn fetch-timeout {"timeoutMs":10000,"elapsedMs":430420,"timerDelayMs":420420,
                    "eventLoopDelayHint":"timer delayed 420420ms, likely event-loop starvation"}
  • Timer delays of 420 seconds (7 minutes) indicate severe event loop starvation

Root Cause Analysis

Primary Cause: Blocking HTTP Calls in Telegram startAccount

The Telegram channel startAccount phase makes synchronous blocking HTTP calls to:

  1. getMe - Bot API probe (line 571 in probe-DuPRVUmp.js)
  2. deleteWebhook - Cleanup before polling
  3. getWebhookInfo - Check webhook state

Code location: server-channels-CrJ7hZRA.js:402

const trackedPromise = Promise.resolve().then(() =>
  measureStartup(`channels.${channelId}.start-account`, () =>
    startAccount({ cfg, accountId: id, account, runtime, ... })
  )
);

The Problem

When Telegram's Bot API is slow or experiencing issues:

  1. fetchWithTimeout() calls block the Node.js event loop
  2. The entire gateway becomes unresponsive
  3. Agent sessions like Marcus appear "stuck" - they can't process because the event loop is blocked

Why This Is Critical

The fetchWithTimeout() function in fetch-timeout-BsLaC-cZ.js:

async function fetchWithTimeout(url, init, timeoutMs, fetchFn = fetch) {
  const { signal, cleanup } = buildTimeoutAbortSignal({...});
  try {
    return await fetchFn(url, { ...init, signal });  // <-- BLOCKING
  } finally {
    cleanup();
  }
}

Even though it uses AbortController, the underlying fetch can still block for the full timeout duration while holding the event loop.

Evidence from Logs

  1. Timer delays > 400 seconds: Event loop completely blocked
  2. sessions.list taking 2,731,729ms (45 minutes): Gateway couldn't process basic operations
  3. Telegram polling stalls: getUpdates calls timing out after 287 seconds

Why #74404 Fix Appears Regressed

The blocklist states #74404 was "fixed in 2026.5.2 stable". However:

  1. I am run 2026.5.12-beta.1 - a later beta version
  2. Telegram startup blocking is not addressed in the #74404 fix (which focused on sessions.list performance)
  3. The Telegram API calls are inherently blocking - no concurrent execution

Fix Suggestions

Fix 1: Non-Blocking Telegram Startup (Recommended)

File: server-channels-CrJ7hZRA.js

Move Telegram's startAccount to run in a separate task queue or worker thread:

// Current (BLOCKING):
const trackedPromise = Promise.resolve().then(() =>
  measureStartup(`channels.${channelId}.start-account`, () =>
    startAccount({ cfg, accountId: id, account, runtime, ... })
  )
);

// Fixed (NON-BLOCKING):
// Run startAccount in a separate microtask to not block the gateway
const trackedPromise = Promise.resolve().then(async () => {
  // Don't await blocking operations in the gateway task queue
  if (channelId === 'telegram') {
    // Spawn as detached task
    setImmediate(() => {
      measureStartup(`channels.${channelId}.start-account`, () =>
        startAccount({ cfg, accountId: id, account, runtime, ... })
      );
    });
    return;
  }
  return measureStartup(`channels.${channelId}.start-account`, () =>
    startAccount({ cfg, accountId: id, account, runtime, ... })
  );
});

Fix 2: Implement Circuit Breaker for Telegram API

File: probe-DuPRVUmp.js

Add a circuit breaker that fails fast when Telegram is unresponsive:

const circuitBreaker = {
  failures: 0,
  maxFailures: 3,
  resetTimeout: 30000, // 30 seconds

  async call(fn) {
    if (this.failures >= this.maxFailures) {
      throw new Error('Circuit breaker open - Telegram API unavailable');
    }
    try {
      return await fn();
    } catch (err) {
      this.failures++;
      if (this.failures >= this.maxFailures) {
        setTimeout(() => this.failures = 0, this.resetTimeout);
      }
      throw err;
    }
  }
};

// Usage:
meRes = await circuitBreaker.call(() =>
  fetchWithTimeout(`${base}/getMe`, {}, timeoutBudgetMs, fetcher)
);

Fix 3: Use Worker Threads for Blocking HTTP

For Node.js, wrap blocking HTTP calls in a Worker thread:

const { Worker } = require('worker_threads');

async function startAccountInWorker(params) {
  return new Promise((resolve, reject) => {
    const worker = new Worker('./telegram-start-worker.js', {
      workerData: params
    });
    worker.on('message', resolve);
    worker.on('error', reject);
    worker.on('exit', (code) => {
      if (code !== 0) reject(new Error(`Worker exited with code ${code}`));
    });
  });
}

Fix 4: Timeout with Aggressive Retry Limits

File: probe-DuPRVUmp.js:571

Reduce the timeout budget and fail faster:

// Current: timeoutBudgetMs can be very large
meRes = await fetchWithTimeout(`${base}/getMe`, {}, timeoutBudgetMs, fetcher);

// Fixed: Cap at 5 seconds, fail fast
const TEELEGRAM_START_TIMEOUT = 5000; // 5 seconds max
meRes = await fetchWithTimeout(`${base}/getMe`, {}, TEELEGRAM_START_TIMEOUT, fetcher);

Files Involved

FileIssue
fetch-timeout-BsLaC-cZ.jsBlocking fetch implementation
probe-DuPRVUmp.js:571Telegram getMe probe
server-channels-CrJ7hZRA.js:402startAccount invocation
extensions/telegram/*Telegram channel implementation

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Do I need to explain?

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix [Bug]: Agents stuck after session reset - showing "writing" indicator but no output