openclaw - 💡(How to fix) Fix Plugin loaded twice per gateway run — hooks register at +2.0s but service.start() doesn't fire until +9.5s [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#72949Fetched 2026-04-28 06:29:40
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Participants
Timeline (top)
closed ×1commented ×1cross-referenced ×1

In plain English: the deep-observability plugin's hook-registration block is logged twice in a single gateway start — once at +2.0s under source [gateway], then again at +9.5s under source [plugins], with service.start() running only on the second load. This means there's a ~7-second window after gateway "ready" where hooks ARE registered but telemetry === null (because the closure-resolved telemetry only initializes inside service.start()). Any agent invocation arriving in that window has hooks fire but produces no spans — silently. The double-load also doubles the diagnostic-noise surface in logs.

Root Cause

In plain English: the deep-observability plugin's hook-registration block is logged twice in a single gateway start — once at +2.0s under source [gateway], then again at +9.5s under source [plugins], with service.start() running only on the second load. This means there's a ~7-second window after gateway "ready" where hooks ARE registered but telemetry === null (because the closure-resolved telemetry only initializes inside service.start()). Any agent invocation arriving in that window has hooks fire but produces no spans — silently. The double-load also doubles the diagnostic-noise surface in logs.

Fix Action

Fix / Workaround

Independent of the above: have the gateway's request-dispatch path WAIT for service.start() to complete before invoking any plugin hook. That eliminates the null-telemetry race entirely, even if the double-load remains.

Code Example

2026-04-27T07:17:05.456 [gateway] [otel] Registered message_received hook (via api.on)
2026-04-27T07:17:05.457 [gateway] [otel] Registered message_sent hook (via api.on)
... (10 more "Registered ... hook" lines)
2026-04-27T07:17:05.466 [gateway] [otel] Registered gateway:startup hook (via api.registerHook)
2026-04-27T07:17:05.469 [gateway] [plugins] plugin register returned a promise; async registration is ignored (plugin=openclaw-deep-observability, source=...)

2026-04-27T07:17:05.518 [gateway] ready (6 plugins, 2.0s)

2026-04-27T07:17:12.723 [plugins] embedded acpx runtime backend registered (cwd: /sandbox/.openclaw/workspace)
2026-04-27T07:17:12.752 [gateway] [otel] Starting OpenTelemetry Observe PoC...
2026-04-27T07:17:12.756 [gateway] [otel] Trace exporter → http://172.17.0.1:14318/v1/traces (http)
2026-04-27T07:17:12.780 [gateway] [otel]Observe PoC pipeline active
2026-04-27T07:17:13.240 [plugins] [otel] Registered message_received hook (via api.on)
2026-04-27T07:17:13.241 [plugins] [otel] Registered message_sent hook (via api.on)
... (10 more lines)

---

[gateway] loading plugin (stage=gateway-init): openclaw-deep-observability
[plugins] loading plugin (stage=acpx-runtime): openclaw-deep-observability

---

# Restart gateway, immediately fire an agent request:
openclaw gateway > gw.log 2>&1 &
sleep 0.5
openclaw agent --agent main --message "ping"  # arrives in the +2.0s..+9.5s window
# After fix: this request should produce spans in ClickHouse.
# Before fix: spans for this request are missing (hooks fired but telemetry was null).
RAW_BUFFERClick to expand / collapse

Summary

In plain English: the deep-observability plugin's hook-registration block is logged twice in a single gateway start — once at +2.0s under source [gateway], then again at +9.5s under source [plugins], with service.start() running only on the second load. This means there's a ~7-second window after gateway "ready" where hooks ARE registered but telemetry === null (because the closure-resolved telemetry only initializes inside service.start()). Any agent invocation arriving in that window has hooks fire but produces no spans — silently. The double-load also doubles the diagnostic-noise surface in logs.

Repro

Start gateway with the deep-observability plugin loaded; watch gw.log:

2026-04-27T07:17:05.456 [gateway] [otel] Registered message_received hook (via api.on)
2026-04-27T07:17:05.457 [gateway] [otel] Registered message_sent hook (via api.on)
... (10 more "Registered ... hook" lines)
2026-04-27T07:17:05.466 [gateway] [otel] Registered gateway:startup hook (via api.registerHook)
2026-04-27T07:17:05.469 [gateway] [plugins] plugin register returned a promise; async registration is ignored (plugin=openclaw-deep-observability, source=...)

2026-04-27T07:17:05.518 [gateway] ready (6 plugins, 2.0s)

2026-04-27T07:17:12.723 [plugins] embedded acpx runtime backend registered (cwd: /sandbox/.openclaw/workspace)
2026-04-27T07:17:12.752 [gateway] [otel] Starting OpenTelemetry Observe PoC...
2026-04-27T07:17:12.756 [gateway] [otel] Trace exporter → http://172.17.0.1:14318/v1/traces (http)
2026-04-27T07:17:12.780 [gateway] [otel] ✅ Observe PoC pipeline active
2026-04-27T07:17:13.240 [plugins] [otel] Registered message_received hook (via api.on)
2026-04-27T07:17:13.241 [plugins] [otel] Registered message_sent hook (via api.on)
... (10 more lines)

Two distinct hook-registration blocks. First block under source [gateway] at +2.0s. Second block under source [plugins] at +9.5s, after the embedded acpx runtime backend registers and service.start() fires.

Confirmed via the deep-observability plugin's source: the [otel] ✅ Observe PoC pipeline active line is logged inside service.start()'s callback. So telemetry is initialized only AFTER the second load completes.

Why it matters

The plugin's hooks are registered with closures that look up telemetry lazily (() => telemetry). Until service.start() runs, telemetry === null for the FIRST plugin instance's hooks. If the gateway processes an agent invocation in the +2.0s → +9.5s window, those hooks fire but produce no spans (the early return on null telemetry).

Empirically we saw this only on first-traffic-after-restart scenarios; in normal load the +9.5s gap is hidden because real traffic doesn't arrive that fast. But it's a latent bug — a fast first request after a gateway restart can produce a span gap with no diagnostic.

Root cause (best guess)

The two log sources ([gateway] vs [plugins]) suggest two distinct plugin loaders:

  • The gateway's primary loader, which walks /sandbox/.openclaw-data/extensions/ and loads each plugin at gateway startup (+2.0s block).
  • The embedded acpx runtime backend's plugin loader, which loads its own copy of the plugin tree when the runtime backend registers (+9.5s block).

That second load might be intentional (the acpx runtime processes agent invocations and may need its own plugin context), but neither block is documented; we inferred this from log inspection.

Suggested fix (in priority order)

1. Single-load the plugin under one consistent registry

If the two loaders are accidentally redundant: pick one as the canonical load path; have the other reference the already-loaded plugin instance. Eliminates the double-registration noise + closes the 7-second hook-fire-but-null-telemetry window.

2. Document the two-phase load explicitly

If the double-load is intentional (e.g., one is for gateway-level hooks, the other for agent-runtime hooks): add a docs section explaining when each loader fires, what state is available at each stage, and what plugins should expect. Also add a diag log line at INFO level naming the loader stage:

[gateway] loading plugin (stage=gateway-init): openclaw-deep-observability
[plugins] loading plugin (stage=acpx-runtime): openclaw-deep-observability

3. Guarantee service.start() runs before any hook is invoked

Independent of the above: have the gateway's request-dispatch path WAIT for service.start() to complete before invoking any plugin hook. That eliminates the null-telemetry race entirely, even if the double-load remains.

Alternatives considered

  • Skip the second load if a plugin instance is already registered: simplest dedup, but only if the two loaders are functionally equivalent (which we don't know without source-diving).
  • Force service.start() to be synchronous: doesn't fix the timing because services start as a separate phase from registration.

Test plan

# Restart gateway, immediately fire an agent request:
openclaw gateway > gw.log 2>&1 &
sleep 0.5
openclaw agent --agent main --message "ping"  # arrives in the +2.0s..+9.5s window
# After fix: this request should produce spans in ClickHouse.
# Before fix: spans for this request are missing (hooks fired but telemetry was null).

Risk / blast radius

  • Single-load fix (#1): risk depends on whether the acpx runtime needs its own plugin context. Need source review.
  • Documentation-only fix (#2): zero risk.
  • Wait-for-service-start (#3): adds a small startup latency to first-request-after-restart, but eliminates the race. Moderate risk if startup latency matters.

Open questions for maintainers

  1. Is the double-load intentional? If so, what's the design intent?
  2. Are there other plugins (besides deep-observability) that also see this double-registration?
  3. Related: A13 (plugin async register silently truncated) — is one of these two loaders responsible for that warning, or both?

Tested-against

  • OpenClaw v2026.4.9
  • Plugin: outshift-open/openclaw-deep-observability (main HEAD as of 2026-04-26)
  • NemoClaw v0.0.26 sandbox

Severity

Low day-to-day, but a sharp edge for plugin authors: hooks fire before services are ready, with no warning. Could lurk for a long time before producing a user-visible bug.

extent analysis

TL;DR

To fix the issue of the deep-observability plugin's hooks being registered twice, resulting in a 7-second window where hooks fire but produce no spans, consider implementing a single-load mechanism for the plugin under one consistent registry.

Guidance

  • Review the plugin loading mechanism to determine if the double-load is intentional or accidental. If accidental, pick one loader as the canonical load path and have the other reference the already-loaded plugin instance.
  • Consider documenting the two-phase load explicitly, including adding diagnostic log lines to name the loader stage, to help plugin authors understand the loading process.
  • Implement a mechanism to guarantee that service.start() runs before any hook is invoked, eliminating the null-telemetry race.
  • Test the fix using the provided test plan to ensure that spans are produced for requests arriving in the previously problematic window.

Example

No code snippet is provided as the issue does not contain sufficient information to generate a specific code example.

Notes

The fix may depend on the specific requirements of the acpx runtime backend and the deep-observability plugin. Reviewing the source code and understanding the design intent behind the double-load is crucial to implementing the correct solution.

Recommendation

Apply the single-load workaround, as it is the most straightforward solution to eliminate the double-registration noise and the 7-second window where hooks fire but produce no spans. This approach has the potential to fix the issue without introducing significant latency or risk.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

openclaw - 💡(How to fix) Fix Plugin loaded twice per gateway run — hooks register at +2.0s but service.start() doesn't fire until +9.5s [1 comments, 2 participants]