openclaw - ✅(Solved) Fix /v1/* endpoints hang on healthy gateway; event loop blocked by bonjour/telegram/model-pricing timeouts [1 pull requests, 1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#74633Fetched 2026-04-30 06:21:54
View on GitHub
Comments
0
Participants
1
Timeline
1
Reactions
0
Author
Participants
Timeline (top)
cross-referenced ×1

Root Cause

`/healthz` survives because it's served on a fast path that doesn't depend on the stalled work; `/v1/*` sits behind whatever's blocked.

Fix Action

Fix / Workaround

Something on the gateway HTTP `/v1/*` middleware or its dispatcher shares the event loop with the stalling subsystems above. Either:

  • one of those subsystems (bonjour advertiser, telegram long-poll fetch, model-pricing fetch) is doing sync work / awaiting a non-cancellable promise, or
  • the `/v1/*` router is awaiting a resource (auth resolver? agent registry?) that the same blocked code path is gating.

PR fix notes

PR #74762: fix: gateway model catalog cache regression

Description (problem / solution / changelog)

Summary

Found one regression in the new gateway model catalog cache: it treats an empty catalog as a successful cached catalog, which breaks the underlying retry-on-empty contract.

What ClawSweeper Is Fixing

  • Medium: Gateway caches transient empty model catalogs until reload/restart (regression)
    • File: src/gateway/server-model-catalog.ts:49
    • Evidence: startGatewayModelCatalogRefresh() assigns lastSuccessfulCatalog = catalog for every resolved array, including []. Later, loadGatewayModelCatalog() returns lastSuccessfulCatalog whenever it is truthy, and empty arrays are truthy in JS. The underlying loader explicitly avoids caching empty results at src/agents/model-catalog.ts:215 because an empty catalog can come from transient dependency/filesystem/provider issues and should be retried.
    • Impact: if the first gateway catalog load returns [], models.list, TUI model surfaces, session/model metadata helpers, and related gateway callers keep seeing no models until a model config reload or process restart. This is worse than the prior behavior, where the next request retried immediately.
    • Suggested fix: preserve the underlying no-cache-on-empty behavior in the gateway wrapper. Do not mark an empty result as fresh; keep the cache stale or clear it so the next call retries. Add a regression test where the injected loader returns [] once and a non-empty catalog on the second call.
    • Confidence: high

Expected Repair Surface

  • src/gateway/server-model-catalog.ts
  • src/gateway/server-model-catalog.test.ts
  • src/gateway/server-reload-handlers.ts

Source And Review Context

Expected validation

  • pnpm check:changed

ClawSweeper already ran:

  • pnpm docs:list
  • pnpm install after the first targeted test failed because node_modules was missing
  • pnpm test src/gateway/server-model-catalog.test.ts -- --reporter=verbose passed
  • Injected smoke with first loader call returning [] and second returning a model produced {"first":[],"second":[],"calls":1}, confirming the retry is suppressed
  • git diff --check 57a3d7f6e897f25073e313d5c24b6fb6f60575ae..6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3

Known review limits:

  • Full suite and live gateway smoke were not run; review used focused gateway tests and an injected runtime proof.

ClawSweeper Guardrails

  • Re-check the finding against latest main before changing code.
  • Keep the patch to the narrowest behavior change and matching regression coverage.
  • Do not merge automatically; this PR stays for maintainer review.

ClawSweeper 🐠 replacement reef notes:

  • Cluster: clawsweeper-commit-openclaw-openclaw-6421e1f36a3c
  • Source PRs: none
  • Credit: Detected by ClawSweeper commit review for 6421e1f36a3cfdf3ab1b4502b36fe718e0d662d3.; Original commit author: Peter Steinberger.
  • Validation: pnpm check:changed

fish notes: model gpt-5.5, reasoning medium; reviewed against da5e171ffab1.

Changed files

  • src/gateway/server-model-catalog.test.ts (modified, +18/-0)
  • src/gateway/server-model-catalog.ts (modified, +1/-1)
RAW_BUFFERClick to expand / collapse

Version: 2026.4.26 (be8c246) Environment:

  • macOS 15.4 (24E248), Intel x86_64
  • Node v24.14.1 (Homebrew node@24)
  • OpenClaw installed via npm global (/usr/local/Cellar/node@24/.../lib/node_modules/openclaw)
  • Single gateway, default port 18789, bind: loopback

Symptom

The gateway reports healthy, but every request to the /v1/* route family hangs and returns 0 bytes.

```bash $ curl -sS --max-time 5 http://127.0.0.1:18789/healthz {"ok":true,"status":"live"} # 200, instant

$ curl -sS --max-time 5 http://127.0.0.1:18789/v1/models curl: (28) Operation timed out after 5001 ms with 0 bytes received

$ curl -sS --max-time 30 -X POST http://127.0.0.1:18789/v1/chat/completions
-H "Authorization: Bearer $TOKEN" -H "Content-Type: application/json"
-d '{"model":"openclaw:main","messages":[{"role":"user","content":"hi"}]}' curl: (28) Operation timed out after 30062 ms with 0 bytes received ```

`/v1/*` hangs with or without the auth header, and before any request line is logged by the gateway HTTP layer.

Diagnosis: blocked Node event loop

`gateway.err.log` shows three unrelated subsystems all stalling on the same ~60s cycle, which is the classic signature of a blocked event loop:

``` [diagnostic] stuck session: sessionKey=agent:main:main state=processing age=154s queueDepth=1 [telegram] Polling stall detected (active getUpdates stuck for 153.4s); forcing restart [telegram] [diag] polling cycle finished reason=polling stall detected ... durationMs=153406 [plugins] bonjour: service stuck in announcing for 69832ms; restarting advertiser [plugins] bonjour: watchdog detected non-announced service; attempting re-advertise (state=probing) [ws] handshake timeout conn=... [model-pricing] OpenRouter pricing fetch failed (timeout 60s): TimeoutError: The operation was aborted due to timeout [model-pricing] LiteLLM pricing fetch failed (timeout 60s): TimeoutError: The operation was aborted due to timeout [ws] ⇄ res ✓ chat.history 19744ms # WS round-trips also abnormally slow [ws] ⇄ res ✓ models.list 19751ms ```

`/healthz` survives because it's served on a fast path that doesn't depend on the stalled work; `/v1/*` sits behind whatever's blocked.

What is not the cause (already ruled out)

HypothesisResult
Auth profile expiredRe-issued OAuth; pinned valid profile via `openclaw models auth order set --provider openai-codex --agent main <profile> openai-codex:default`. No change.
Model alias (`gpt-5.4` → `gpt-5.5`)Switched. No change.
Embedded agent runtime / specific provider`main` (`openai-codex/gpt-5.5`) and `personal-assistant` (`deepseek/deepseek-chat`) hang identically.
Tailscale ServeReproduces on `127.0.0.1` localhost.
Outbound networkDirect `curl` to `api.openai.com`, `api.deepseek.com`, `api.moonshot.ai` all return.
`agents.defaults.timeoutSeconds` too lowNot bumped on purpose — a `"hi"` prompt should not legitimately need >30s.

Repro

  1. Run a default OpenClaw gateway on macOS Intel + Node 24.
  2. Wait until `gateway.err.log` shows the bonjour / telegram / model-pricing timeout cycle (typically within a few minutes).
  3. `curl /healthz` → 200 instant.
  4. `curl /v1/models` → hangs to timeout, 0 bytes, no log line.

Suspected area

Something on the gateway HTTP `/v1/*` middleware or its dispatcher shares the event loop with the stalling subsystems above. Either:

  • one of those subsystems (bonjour advertiser, telegram long-poll fetch, model-pricing fetch) is doing sync work / awaiting a non-cancellable promise, or
  • the `/v1/*` router is awaiting a resource (auth resolver? agent registry?) that the same blocked code path is gating.

Happy to capture more (a `--inspect` CPU profile, `SIGUSR2` heap snapshot, or anything else) — just say the word.

extent analysis

TL;DR

The issue is likely caused by a blocked Node event loop due to synchronous work or non-cancellable promises in the gateway HTTP /v1/* middleware or its dispatcher, which shares the event loop with stalling subsystems.

Guidance

  • Investigate the bonjour advertiser, telegram long-poll fetch, and model-pricing fetch subsystems for potential synchronous work or non-cancellable promises that could be blocking the event loop.
  • Review the /v1/* router code to identify any resources (e.g., auth resolver, agent registry) that may be gated by the same blocked code path.
  • Consider using --inspect to capture a CPU profile or SIGUSR2 to capture a heap snapshot to further diagnose the issue.
  • Verify that the issue is not caused by a specific middleware or dispatcher by temporarily disabling or bypassing them.

Example

No code snippet is provided as the issue is more related to the overall system architecture and event loop blocking rather than a specific code snippet.

Notes

The issue seems to be related to the Node event loop being blocked, which is a complex problem that requires careful investigation and diagnosis. The provided guidance is meant to help narrow down the potential causes and identify the root of the issue.

Recommendation

Apply a workaround by identifying and fixing the blocked event loop cause, such as rewriting synchronous code to be asynchronous or using cancellable promises. This will likely require a code change to the affected subsystems or middleware.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - ✅(Solved) Fix /v1/* endpoints hang on healthy gateway; event loop blocked by bonjour/telegram/model-pricing timeouts [1 pull requests, 1 participants]