openclaw - 💡(How to fix) Fix gateway: scheduler fairness — human-facing sessions starved by concurrent background agents on shared event loop

Code Example

11:57:16  active=1 q=1   work=[agent:main:slack:direct:<uid>(processing/model_call age=0s)]
                          eventLoopDelayP99=25.6   cpuCoreRatio=0.153
12:00:16  active=1 q=2   active=agent:main:cron-lp-reply-handler(tool_call age=6s)
                          queued=agent:main:slack:direct:<uid>(idle/model_call age=5s)
                          eventLoopDelayP99=99.7   cpuCoreRatio=0.437
12:02:18  active=0 q=2   queued=agent:main:slack:direct:<uid>(idle/model_call age=6s)
                          eventLoopDelayP99=191.5  cpuCoreRatio=0.496
12:04:48  active=1 q=3   active=agent:bug-fixer:cron:<id>(processing/model_call age=0s)
                          queued=agent:main:slack:direct:<uid>(idle/model_call age=15s)
                          eventLoopDelayP99=880.8  cpuCoreRatio=0.734

Problem

The OpenClaw gateway runs all agents (main user sessions, cron-driven agents, Bug Fixer, debug-agent, subagents, channel plugins) inside a single Node.js process on a single event loop, with no scheduler-level awareness of which agent's work should take priority when the loop is contended. As a result, an interactive human-facing session (e.g., a Slack DM) can sit in idle/model_call for many minutes while background/cron agents hold the loop with their own streaming, plugin hooks, and tool calls.

This is distinct from the existing event-loop saturation work (#78861, #77532, #82462, #58519): those address making individual heavy paths cheaper, or moving specific subsystems (keepalive, stream parsing) off the main thread. Even after those land, two heavy agents in the same process will still compete on whatever event-loop budget remains — and the gateway has no policy that says "prefer the human waiting on Slack over the cron sweeper."

#82017 (pluggable subagent backends) is the orthogonal "move workers off-host entirely" solution. This issue asks for the in-process complement: when multiple agents share one gateway, give the scheduler the policy hooks to differentiate them.

Empirical observation (2026-05-20)

Single-host deployment, OpenClaw current main, Node 22.x. A user sent three follow-up Slack DMs ("Done yet?") while the main DM session was waiting on a model response. Liveness samples from the relevant window:

11:57:16  active=1 q=1   work=[agent:main:slack:direct:<uid>(processing/model_call age=0s)]
                          eventLoopDelayP99=25.6   cpuCoreRatio=0.153
12:00:16  active=1 q=2   active=agent:main:cron-lp-reply-handler(tool_call age=6s)
                          queued=agent:main:slack:direct:<uid>(idle/model_call age=5s)
                          eventLoopDelayP99=99.7   cpuCoreRatio=0.437
12:02:18  active=0 q=2   queued=agent:main:slack:direct:<uid>(idle/model_call age=6s)
                          eventLoopDelayP99=191.5  cpuCoreRatio=0.496
12:04:48  active=1 q=3   active=agent:bug-fixer:cron:<id>(processing/model_call age=0s)
                          queued=agent:main:slack:direct:<uid>(idle/model_call age=15s)
                          eventLoopDelayP99=880.8  cpuCoreRatio=0.734

What the user sees: the bot goes silent for ~8+ minutes despite three "is it done?" pings. The DM session itself is healthy — it's queued behind a cron-lp-reply-handler tool call, then a Bug Fixer cron model call. There is no signal exposed to the scheduler that the human-facing DM should preempt these.

Why existing work doesn't close this

#78861 / #77532 / #82462 reduce per-turn cost. Helpful, but two agents at any nonzero cost still serialize on one loop.
#58519 worker-thread for Slack keepalive keeps the socket alive, but doesn't get the user's reply out any faster.
#82017 pluggable subagent backends can move subagents off-host. Doesn't apply to first-class gateway agents (Bug Fixer cron, channel-attached main sessions) that need OpenClaw-internal state.
agents.defaults.maxConcurrent caps concurrency, but it's agent-agnostic — capping at 1 also stops legitimate parallelism (cron + DM) and just shifts the queue, not its ordering.

Suggested directions

Not prescribing an implementation, but a sketch of the policy surfaces that would help:

Per-agent / per-trigger priority class. Config-declared priorities — e.g., interactive (Slack DM, Telegram DM, TUI), service (channel cron, scheduled), background (Bug Fixer cron, maintenance sweepers). Scheduler picks higher-priority work first when multiple agents have queued turns.
Preemption signal between turns. Background agents check a preemption flag at safe yield points (between SSE chunks, between tool calls) and voluntarily defer the next chunk if a higher-priority agent has work queued. Doesn't require interrupting a mid-flight tool call — just delaying the next unit of work.
Queue-depth-aware demotion. When the human-facing queue depth for a session exceeds N (e.g., user has sent 3 pings), automatically demote background agents until the queue drains. The 12:04:48 sample above is exactly this signal.
Operator-visible fairness metrics. Expose head-of-line wait per priority class in gateway diagnostics export and liveness warnings, so saturation events can be classified as "fair queue contention" vs "true starvation."
openclaw doctor rule. Warn when a priority class's p95 head-of-line wait exceeds a threshold — gives operators a concrete signal before users complain.

Acceptance criteria (possible first milestone)

Config schema for declaring agent priority classes (default mappings for built-in channels and cron agents).
Scheduler honors priority when selecting the next runnable turn from the active set.
Liveness work=[...] entries include the priority class label.
At least one yield-point between SSE chunks (or equivalent) so a long stream from a background agent doesn't block a queued interactive turn indefinitely.
Test: with one background agent mid-stream and one interactive turn queued, the interactive turn begins within a bounded number of chunks.

Environment

OpenClaw 2026.5.5 (running on a deployment that skipped 5.6 due to #78604; not yet on 5.7)
Node 22.x linux/x64
Gateway: mode: local, single instance
Channels: slack, whatsapp (externalized), with cron + Bug Fixer + occasional debug-agent activity
8 plugins loaded (post heavy-plugin disables per #77347/#77348)

Happy to share a sanitized gateway diagnostics export with the head-of-line timing if useful for repro/profiling.

#78861 — canonical broad gateway-pressure tracker
#77532 — stream-ready / core-plugin-tools per-turn latency
#82462 — model-fetch streaming starves event loop (closed; chunk-yield direction overlaps with #2 above)
#58519 — Slack Socket Mode pong starvation (worker-thread direction)
#82017 — Pluggable subagent execution backends (orthogonal — move work off-host vs. fairness in-host)
#75147 — agent drops tool execution responses when new user message arrives mid-task (adjacent: queue/turn boundary semantics)

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix gateway: scheduler fairness — human-facing sessions starved by concurrent background agents on shared event loop

Recommended Tools

GitHub issue graph ai analysis

Error Message