openclaw - ✅(Solved) Fix [Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24 [2 pull requests, 9 comments, 8 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#73323Fetched 2026-04-29 06:21:02
View on GitHub
Comments
9
Participants
8
Timeline
24
Reactions
0
Timeline (top)
commented ×9cross-referenced ×5subscribed ×5mentioned ×4

Gateway long-running Node process exhibits multi-subsystem network/timer degradation (model-pricing fetch 60s timeouts, Telegram polling stalls 127–266s, RPC slowdowns 8–83s) reproducible across 2026.4.23, 2026.4.25, and 2026.4.26 on Windows 11 build 26100.8115 + Node 24.14.1. From a standalone Node process on the same machine, fetch() to the same endpoints completes in 100–800ms.

Error Message

gateway/model-pricing | OpenRouter pricing fetch failed (timeout 60s): TimeoutError gateway/model-pricing | LiteLLM pricing fetch failed (timeout 60s): TimeoutError gateway/channels/telegram | [telegram] Polling stall detected (active getUpdates stuck for 127.45s); forcing restart. gateway/channels/telegram | polling cycle finished reason=polling stall detected ... durationMs=127457 error=Network request for 'getUpdates' failed! gateway/channels/telegram | telegram sendMessage failed: Network request for 'sendMessage' failed! gateway/channels/telegram | telegram message processing failed: HttpError: Network request for 'sendMessage' failed! gateway/ws | res ✓ models.list 55798ms (normally <500ms) gateway/ws | res ✓ models.list 83581ms gateway/ws | res ✓ doctor.memory.status 35988ms diagnostic | stuck session: state=processing age=282s queueDepth=1

Root Cause

Gateway long-running Node process exhibits multi-subsystem network/timer degradation (model-pricing fetch 60s timeouts, Telegram polling stalls 127–266s, RPC slowdowns 8–83s) reproducible across 2026.4.23, 2026.4.25, and 2026.4.26 on Windows 11 build 26100.8115 + Node 24.14.1. From a standalone Node process on the same machine, fetch() to the same endpoints completes in 100–800ms.

Fix Action

Fix / Workaround

HypothesisEvidence against
Bot token / Telegram API issuecurl https://api.telegram.org/bot<token>/getMe returns ok=true in 0.1s, consistently
Public network slowStandalone node -e "fetch(...)" hits api.telegram.org and openrouter.ai/api/v1/models in 100–800ms
IPv6 vs IPv4Both --dns-result-order=ipv4first and default IPv6-first succeed via standalone Node fetch in <120ms; DNS resolves both A and AAAA cleanly
Bundled plugin runtime deps missingopenclaw doctor --fix reports all deps installed
fetchWithSsrFGuard connection poolVerified in dist/fetch-guard-C10MVwBt.js the SSRF guard creates a per-call dispatcher and disposes on completion. Pricing code (dist/usage-format-ZhKID6__.js) uses raw fetch + AbortSignal.timeout(60000), not SSRF wrapper, and still times out
OS-level network state corruptionFull Windows reboot (cold boot to gateway start) reproduces chronic within ~30 minutes
4.25 / 4.26 regressionIdentical signatures on 2026.4.23 (a979721) before any 4.25/4.26 install
Node 24 specificSame Node 24 binary fetches fine from a standalone process — only the long-running gateway process degrades

Workaround attempts that did NOT help:

  • openclaw doctor --fix (3 cycles)
  • openclaw gateway restart (10+ cycles)
  • Hard kill (Stop-Process -Force on PID owning :18789 + tray) → clean restart
  • Full Windows 11 reboot
  • Downgrade 4.25 → 4.23 → back to 4.25 → 4.26
  • channels.telegram.pollTimeoutMs: 5000 (vs default 30000)
  • Force IPv4 via NODE_OPTIONS=--dns-result-order=ipv4first
  • Removed unused providers (arcee/openrouter)
  • openclaw sessions cleanup --enforce --fix-missing

Hypotheses (ranked) for maintainers:

  1. Shared global undici dispatcher / Agent state degrades over time. Multiple subsystems (model-pricing, Telegram grammyjs runner, doctor.memory.status) all use shared global undici and all start failing together. Hand-off / keep-alive socket reaping appears to break — getUpdates requests sit 127–266s past their AbortSignal timeout, suggesting the abort/timer layer is no longer firing as expected.
  2. Telegram grammyjs polling runner long-poll keep-alive sockets go stale; runner's stall detector only catches it after 127–197s. Plausibly correlates with pricing-fetch / RPC slowdown if all three share the same global dispatcher.
  3. Event-loop starvation during channels-and-sidecars phase — models.list 55–83s, node.list 8.9s, doctor.memory.status 35s suggests a long-running synchronous task is blocking the loop, which would also explain pricing-fetch timers not firing.

PR fix notes

PR #73486: fix(gateway): defer pricing refresh until ready

Description (problem / solution / changelog)

## Summary

  • Move the model-pricing refresh out of the initial Gateway runtime setup path.
  • Start the pricing refresh only after sidecars/channels have reached the ready path and scheduled services are activated.
  • Keep pricing enabled by default; this does not add another disable switch or change OpenRouter/LiteLLM fetch timeouts.

Why

Issue #73323 reports Windows Gateway startup degradation where OpenRouter/LiteLLM pricing fetches time out and Telegram/channel startup is delayed. The remote pricing catalogs are optional cost enrichment, so they should not be able to compete with channel readiness during startup.

This is intentionally narrower than the broader runtime/network degradation investigation. It preserves the pricing feature while ensuring the refresh cannot run before Gateway/channel readiness.

Validation

  • pnpm test src/gateway/server-runtime-services.test.ts
  • pnpm exec oxfmt --check --threads=1 src/gateway/server-runtime-services.ts src/gateway/server-runtime-services.test.ts src/gateway/server.impl.ts
  • pnpm tsgo:core
  • pnpm tsgo:core:test
  • git diff --check

Related: #73323

Changed files

  • src/gateway/server-runtime-services.test.ts (modified, +33/-24)
  • src/gateway/server-runtime-services.ts (modified, +12/-12)
  • src/gateway/server.impl.ts (modified, +30/-12)

PR #72033: feat(gateway): add diagnostics.pricing method for pricing cache visibility

Description (problem / solution / changelog)

Summary

  • Adds a diagnostics.pricing gateway method that returns model pricing cache state: cachedAt, age, ttlMs, size
  • Registered under READ_SCOPE alongside diagnostics.stability — no new auth surface
  • Wires the existing but previously unexposed getGatewayModelPricingCacheMeta() into the diagnostics handler pattern

Operators currently have no way to tell whether the pricing cache is populated, stale, or empty after a startup timeout. This came up across multiple issues where [model-pricing] pricing bootstrap failed: TimeoutError left users unable to determine cache state (#53639, #59348, #67653).

Test plan

  • pnpm vitest run src/gateway/server-methods/diagnostics.test.ts — 8 tests pass (2 new pricing + 4 existing stability + 2 pricing-cache)
  • pnpm run tsgo:core — typecheck clean

🤖 Generated with Claude Code

Changed files

  • src/gateway/method-scopes.ts (modified, +1/-0)
  • src/gateway/server-methods-list.ts (modified, +1/-0)
  • src/gateway/server-methods/diagnostics.test.ts (modified, +45/-0)
  • src/gateway/server-methods/diagnostics.ts (modified, +14/-0)

Code Example

gateway/model-pricing      | OpenRouter pricing fetch failed (timeout 60s): TimeoutError
gateway/model-pricing      | LiteLLM pricing fetch failed (timeout 60s): TimeoutError
gateway/channels/telegram  | [telegram] Polling stall detected (active getUpdates stuck for 127.45s); forcing restart.
gateway/channels/telegram  | polling cycle finished reason=polling stall detected ... durationMs=127457 error=Network request for 'getUpdates' failed!
gateway/channels/telegram  | telegram sendMessage failed: Network request for 'sendMessage' failed!
gateway/channels/telegram  | telegram message processing failed: HttpError: Network request for 'sendMessage' failed!
gateway/ws                 | res ✓ models.list 55798ms      (normally <500ms)
gateway/ws                 | res ✓ models.list 83581ms
gateway/ws                 | res ✓ doctor.memory.status 35988ms
diagnostic                 | stuck session: state=processing age=282s queueDepth=1

---

**What does NOT explain it (each tested):**

| Hypothesis | Evidence against |
|---|---|
| Bot token / Telegram API issue | `curl https://api.telegram.org/bot<token>/getMe` returns ok=true in 0.1s, consistently |
| Public network slow | Standalone `node -e "fetch(...)"` hits api.telegram.org and openrouter.ai/api/v1/models in 100–800ms |
| IPv6 vs IPv4 | Both `--dns-result-order=ipv4first` and default IPv6-first succeed via standalone Node fetch in <120ms; DNS resolves both A and AAAA cleanly |
| Bundled plugin runtime deps missing | `openclaw doctor --fix` reports all deps installed |
| `fetchWithSsrFGuard` connection pool | Verified in dist/fetch-guard-C10MVwBt.js the SSRF guard creates a per-call dispatcher and disposes on completion. Pricing code (dist/usage-format-ZhKID6__.js) uses raw fetch + AbortSignal.timeout(60000), not SSRF wrapper, and still times out |
| OS-level network state corruption | Full Windows reboot (cold boot to gateway start) reproduces chronic within ~30 minutes |
| 4.25 / 4.26 regression | Identical signatures on 2026.4.23 (a979721) before any 4.25/4.26 install |
| Node 24 specific | Same Node 24 binary fetches fine from a standalone process — only the long-running gateway process degrades |

**Process resource snapshot at degradation point (PID 16776):**
- working set 616 MB / private 811 MB
- 45 threads
- **3337 handles** (notably high)
- 25 min uptime

**Workaround attempts that did NOT help:**
- `openclaw doctor --fix` (3 cycles)
- `openclaw gateway restart` (10+ cycles)
- Hard kill (`Stop-Process -Force` on PID owning :18789 + tray) → clean restart
- Full Windows 11 reboot
- Downgrade 4.254.23 → back to 4.254.26
- `channels.telegram.pollTimeoutMs: 5000` (vs default 30000)
- Force IPv4 via `NODE_OPTIONS=--dns-result-order=ipv4first`
- Removed unused providers (arcee/openrouter)
- `openclaw sessions cleanup --enforce --fix-missing`

Logs are sanitized of bot tokens / API keys; happy to share unredacted logs privately.
RAW_BUFFERClick to expand / collapse

Bug type

Regression (worked before, now fails)

Beta release blocker

No

Summary

Gateway long-running Node process exhibits multi-subsystem network/timer degradation (model-pricing fetch 60s timeouts, Telegram polling stalls 127–266s, RPC slowdowns 8–83s) reproducible across 2026.4.23, 2026.4.25, and 2026.4.26 on Windows 11 build 26100.8115 + Node 24.14.1. From a standalone Node process on the same machine, fetch() to the same endpoints completes in 100–800ms.

Steps to reproduce

  1. npm i -g [email protected] --omit=optional
  2. openclaw doctor --fix (gateway auto-restarts; bundled deps installed cleanly)
  3. Configure Telegram channel: channels.telegram.enabled=true, valid botToken, dmPolicy: allowlist, plugins.entries.telegram.enabled=true
  4. openclaw gateway start → /health returns 200 within ~30s, log shows "ready (2 plugins: memory-core, telegram)"
  5. Wait 2–5 minutes
  6. First Polling stall detected and pricing fetch failed (timeout 60s) log lines appear
  7. Cycle recurs every 2–3 minutes thereafter; getUpdates and sendMessage calls fail with bare Network request for '...' failed!

Expected behavior

Gateway RPC and outbound HTTP fetches complete in <1s consistently, matching the timing observed when the same Node 24.14.1 binary issues fetch() to the same endpoints from a standalone process on the same host:

  • fetch('https://api.telegram.org/bot<token>/getMe') → 106ms (IPv6-first), 116ms (IPv4-first)
  • PowerShell curl to api.telegram.org → 0.1s
  • PowerShell curl to openrouter.ai/api/v1/models → 0.8s Telegram polling and sendMessage should run continuously without 100s+ stalls.

Actual behavior

Multi-subsystem network/timer degradation observed simultaneously inside the long-running gateway process:

gateway/model-pricing      | OpenRouter pricing fetch failed (timeout 60s): TimeoutError
gateway/model-pricing      | LiteLLM pricing fetch failed (timeout 60s): TimeoutError
gateway/channels/telegram  | [telegram] Polling stall detected (active getUpdates stuck for 127.45s); forcing restart.
gateway/channels/telegram  | polling cycle finished reason=polling stall detected ... durationMs=127457 error=Network request for 'getUpdates' failed!
gateway/channels/telegram  | telegram sendMessage failed: Network request for 'sendMessage' failed!
gateway/channels/telegram  | telegram message processing failed: HttpError: Network request for 'sendMessage' failed!
gateway/ws                 | res ✓ models.list 55798ms      (normally <500ms)
gateway/ws                 | res ✓ models.list 83581ms
gateway/ws                 | res ✓ doctor.memory.status 35988ms
diagnostic                 | stuck session: state=processing age=282s queueDepth=1

In a single 1-hour observation window: 6 polling stalls, 4 sendMessage failures, 14 pricing-fetch 60s timeouts, plus multiple models.list / doctor.memory.status / node.list RPCs clocking 8–83s where they normally finish in <500ms.

Direct probes from PowerShell curl https://api.telegram.org/bot<token>/getMe and from a separate node -e "fetch(...)" to the SAME endpoints succeed in 0.1–0.8s consistently throughout these gateway-internal stalls.

OpenClaw version

2026.4.26 (be8c246) — also reproduced on 2026.4.25 (aa36ee6) and 2026.4.23 (a979721)

Operating system

Windows 11 build 26100.8115

Install method

npm global (--omit=optional); Node v24.14.1; PowerShell 5.1

Model

xiaomi/mimo-v2.5-pro (primary); reproduces regardless of model — pricing fetch + Telegram getUpdates stall independent of LLM choice

Provider / routing chain

openclaw -> Telegram polling (bundled grammyjs runner) -> api.telegram.org; openclaw -> xiaomi (mimo via api.xiaomimimo.com); openclaw -> openrouter.ai/api/v1/models + LiteLLM public pricing JSON (gateway-internal hardcoded fetches)

Additional provider/model setup details

No proxy configured (no HTTP_PROXY / HTTPS_PROXY / ALL_PROXY env vars). UK home broadband, no VPN, no corporate firewall. Fallback chain: zai/glm-5.1, xiaomi/mimo-v2.5, minimax/MiniMax-M2.7. All providers reachable when probed from a standalone Node process; degradation is gateway-internal only.

Logs, screenshots, and evidence

**What does NOT explain it (each tested):**

| Hypothesis | Evidence against |
|---|---|
| Bot token / Telegram API issue | `curl https://api.telegram.org/bot<token>/getMe` returns ok=true in 0.1s, consistently |
| Public network slow | Standalone `node -e "fetch(...)"` hits api.telegram.org and openrouter.ai/api/v1/models in 100–800ms |
| IPv6 vs IPv4 | Both `--dns-result-order=ipv4first` and default IPv6-first succeed via standalone Node fetch in <120ms; DNS resolves both A and AAAA cleanly |
| Bundled plugin runtime deps missing | `openclaw doctor --fix` reports all deps installed |
| `fetchWithSsrFGuard` connection pool | Verified in dist/fetch-guard-C10MVwBt.js the SSRF guard creates a per-call dispatcher and disposes on completion. Pricing code (dist/usage-format-ZhKID6__.js) uses raw fetch + AbortSignal.timeout(60000), not SSRF wrapper, and still times out |
| OS-level network state corruption | Full Windows reboot (cold boot to gateway start) reproduces chronic within ~30 minutes |
| 4.25 / 4.26 regression | Identical signatures on 2026.4.23 (a979721) before any 4.25/4.26 install |
| Node 24 specific | Same Node 24 binary fetches fine from a standalone process — only the long-running gateway process degrades |

**Process resource snapshot at degradation point (PID 16776):**
- working set 616 MB / private 811 MB
- 45 threads
- **3337 handles** (notably high)
- 25 min uptime

**Workaround attempts that did NOT help:**
- `openclaw doctor --fix` (3 cycles)
- `openclaw gateway restart` (10+ cycles)
- Hard kill (`Stop-Process -Force` on PID owning :18789 + tray) → clean restart
- Full Windows 11 reboot
- Downgrade 4.254.23 → back to 4.254.26
- `channels.telegram.pollTimeoutMs: 5000` (vs default 30000)
- Force IPv4 via `NODE_OPTIONS=--dns-result-order=ipv4first`
- Removed unused providers (arcee/openrouter)
- `openclaw sessions cleanup --enforce --fix-missing`

Logs are sanitized of bot tokens / API keys; happy to share unredacted logs privately.

Impact and severity

Affected: All gateway-internal outbound HTTP — Telegram polling/sendMessage, model-pricing fetch, in-process gateway RPC (models.list, doctor.memory.status, node.list). Severity: High — Telegram bot replies blocked or delayed 5+ minutes; gateway RPC slow enough that openclaw-sweep tools fail or partial-result. User has to fall back to Tray UI / webchat for any reliable use. Frequency: Always — chronic recurs every 2–3 minutes once gateway is up >5 min, on this Windows 11 + Node 24.14.1 host across 2026.4.23, 2026.4.25, and 2026.4.26. Consequence: Telegram channel effectively unusable; missed/delayed messages; gateway needs constant restart; /health flickers between 200 and timeout.

Additional information

Hypotheses (ranked) for maintainers:

  1. Shared global undici dispatcher / Agent state degrades over time. Multiple subsystems (model-pricing, Telegram grammyjs runner, doctor.memory.status) all use shared global undici and all start failing together. Hand-off / keep-alive socket reaping appears to break — getUpdates requests sit 127–266s past their AbortSignal timeout, suggesting the abort/timer layer is no longer firing as expected.
  2. Telegram grammyjs polling runner long-poll keep-alive sockets go stale; runner's stall detector only catches it after 127–197s. Plausibly correlates with pricing-fetch / RPC slowdown if all three share the same global dispatcher.
  3. Event-loop starvation during channels-and-sidecars phase — models.list 55–83s, node.list 8.9s, doctor.memory.status 35s suggests a long-running synchronous task is blocking the loop, which would also explain pricing-fetch timers not firing.

Note on Codex / parallel diagnostic: An independent agent (Codex) ran a parallel diagnostic on the same machine and concurs the runtime degradation is process-internal, not network-side. openclaw-sweep runId 8924e8d6-d776-4ed5-94be-a87fd194372b available on request.

Last known good version: unknown — bug present in oldest version we could test (2026.4.23). Not a recent regression.

Happy to provide: full gateway log (C:\tmp\openclaw\openclaw-2026-04-2*.log), --inspect profile, OPENCLAW_DEBUG_INGRESS_TIMING=1 / OPENCLAW_DEBUG_HEALTH=1 traces, doctor --deep output (216s runtime; no actionable network-layer findings), or any other diagnostic that helps narrow which layer (undici Agent, grammyjs runner, gateway event loop, Windows-specific socket behavior) is degrading.

extent analysis

TL;DR

The most likely fix for the gateway's multi-subsystem network/timer degradation issue is to investigate and address the potential shared global undici dispatcher/Agent state degradation over time, which is affecting multiple subsystems.

Guidance

  1. Verify undici version: Check the version of undici being used in the project and ensure it is up-to-date, as issues with the dispatcher/Agent state could be related to a specific version.
  2. Monitor event loop: Use tools like --inspect profile or OPENCLAW_DEBUG_INGRESS_TIMING=1 to monitor the event loop and identify any potential long-running synchronous tasks that could be blocking the loop.
  3. Investigate grammyjs polling runner: Look into the grammyjs polling runner's keep-alive socket management and stall detection to see if it's contributing to the issue.
  4. Check for resource leaks: With 3337 handles and 45 threads, it's possible that there's a resource leak; investigate and address any potential leaks to prevent further degradation.
  5. Test with a minimal setup: Try to reproduce the issue with a minimal setup, removing any unnecessary plugins or dependencies to isolate the root cause.

Example

No specific code example is provided, as the issue seems to be related to the interaction between multiple subsystems and libraries. However, monitoring the event loop and investigating the undici dispatcher/Agent state could involve using Node.js built-in tools like process.cpuUsage() or performance.now() to track performance metrics.

Notes

The issue seems to be complex and related to the interaction between multiple subsystems. It's essential to methodically investigate each potential cause, starting with the most likely ones, to identify the root cause of the problem.

Recommendation

Apply a workaround by investigating and addressing the potential shared global undici dispatcher/Agent state degradation over time, as it is the most likely cause of the issue. This may involve updating

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

FAQ

Expected behavior

Gateway RPC and outbound HTTP fetches complete in <1s consistently, matching the timing observed when the same Node 24.14.1 binary issues fetch() to the same endpoints from a standalone process on the same host:

  • fetch('https://api.telegram.org/bot<token>/getMe') → 106ms (IPv6-first), 116ms (IPv4-first)
  • PowerShell curl to api.telegram.org → 0.1s
  • PowerShell curl to openrouter.ai/api/v1/models → 0.8s Telegram polling and sendMessage should run continuously without 100s+ stalls.

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix [Bug]: Gateway runtime degradation: pricing fetch 60s timeouts, Telegram polling stalls, slow RPC — chronic across 4.23/4.25/4.26 on Windows 11 + Node 24 [2 pull requests, 9 comments, 8 participants]