openclaw - 💡(How to fix) Fix Discord WebSocket drops every ~5min in multi-bot swarm (heartbeat starvation from slow LLM responses) [1 comments, 2 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
openclaw/openclaw#48931Fetched 2026-04-08 00:50:55
View on GitHub
Comments
1
Participants
2
Timeline
3
Reactions
0
Author
Participants
Timeline (top)
referenced ×2commented ×1

Root Cause

Root Cause (diagnosed)

RAW_BUFFERClick to expand / collapse

Problem

Running a swarm of 15 Discord bots on a single OpenClaw gateway. Bots using slow local Ollama models (32B, ~30-60s response time) experience WebSocket connection drops approximately every 5 minutes.

Symptoms

  • Health monitor logs: health-monitor: restarting (reason: disconnected) and (reason: stale-socket) every ~5 minutes for specific agents
  • typing TTL reached (2m); stopping typing indicator — agents start typing, get restarted mid-processing, response is lost
  • Agents using fast models (8B, ~5s response) are stable; only 32B agents are affected
  • All 15 bots share one IP, so simultaneous reconnects compound the problem

Root Cause (diagnosed)

Discord's gateway requires a Heartbeat (opcode 1) every ~45 seconds. When a bot is processing a long LLM call (30-60s on a 32B model), the event loop handling the WebSocket heartbeat appears to get blocked or starved. Discord marks the connection as zombie and drops it.

Secondary issues:

  1. No reconnect jitter — health monitor restarts all dropped bots simultaneously, triggering Discord's max_concurrency limit (1 IDENTIFY per 5s per token) and potentially Cloudflare's IP-level rate limit (10k invalid requests/10min)
  2. No state persistence — in-flight LLM responses are lost on restart, causing the "typed and then nothing" UX

Environment

  • Mac Studio (Apple Silicon), macOS
  • 15 Discord bots on one OpenClaw gateway
  • Models: ollama/qwen3-agent:32b and ollama/coder-agent:32b (via local Ollama)
  • OpenClaw version: latest

Requested Fixes

  1. Decouple heartbeat from agent processing — WebSocket keepalive should run on a dedicated thread/loop, never blocked by LLM calls
  2. Staggered reconnect with jitter — when health monitor restarts multiple bots, introduce randomised delay (1-15s) between each IDENTIFY to avoid rate limit spikes
  3. Optional: typing indicator loop — re-trigger typing every 9s while LLM is processing, so users know the bot is still thinking even after a reconnect

Happy to provide gateway logs if useful. This likely affects anyone running multiple agents with slow models on a single gateway.

extent analysis

Fix Plan

To address the issues, we'll implement the following fixes:

  1. Decouple Heartbeat from Agent Processing:
    • Create a separate thread for handling WebSocket heartbeats.
    • Use a library like asyncio or threading to manage the heartbeat thread.

Example using asyncio:

import asyncio

async def send_heartbeat(websocket):
    while True:
        await websocket.send_json({"op": 1})  # Send heartbeat
        await asyncio.sleep(45)  # Wait 45 seconds

# Create a separate task for the heartbeat
async def main(websocket):
    asyncio.create_task(send_heartbeat(websocket))
    # Rest of your code...
  1. Staggered Reconnect with Jitter:
    • Introduce a random delay between 1-15 seconds when restarting multiple bots.
    • Use a library like random to generate the delay.

Example:

import random
import time

def restart_bots(bots):
    for bot in bots:
        # Restart the bot
        # ...
        delay = random.randint(1, 15)
        time.sleep(delay)  # Wait for the random delay
  1. Typing Indicator Loop (Optional):
    • Re-trigger the typing indicator every 9 seconds while the LLM is processing.
    • Use a library like asyncio to manage the typing indicator loop.

Example:

async def typing_indicator_loop(websocket):
    while True:
        await websocket.send_json({"op": 3, "d": {"type": 1}})  # Send typing indicator
        await asyncio.sleep(9)  # Wait 9 seconds

Verification

To verify the fixes, monitor the health monitor logs and WebSocket connections for the bots. Check for:

  • Reduced occurrences of health-monitor: restarting (reason: disconnected) and (reason: stale-socket) logs.
  • Stable connections for bots using slow models.
  • No simultaneous reconnects triggering Discord's max_concurrency limit or Cloudflare's IP-level rate limit.

Extra Tips

  • Consider implementing a queueing system to handle in-flight LLM responses and prevent losses during restarts.
  • Monitor the performance of the bots and adjust the heartbeat interval and typing indicator loop as needed.
  • Keep the OpenClaw gateway and dependencies up-to-date to ensure the latest fixes and features.

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING