openclaw - 💡(How to fix) Fix Discord WebSocket drops every ~5min in multi-bot swarm (heartbeat starvation from slow LLM responses) [1 comments, 2 participants]

openclaw2026-03-17 10:34:36

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

openclaw/openclaw#48931•Fetched 2026-04-08 00:50:55

View on GitHub

Comments

Participants

Timeline

Reactions

Author

RoddyHill

Participants

RoddyHill

Ryce

Timeline (top)

referenced ×2commented ×1

Root Cause

Root Cause (diagnosed)

RAW_BUFFERClick to expand / collapse

Problem

Running a swarm of 15 Discord bots on a single OpenClaw gateway. Bots using slow local Ollama models (32B, ~30-60s response time) experience WebSocket connection drops approximately every 5 minutes.

Symptoms

Health monitor logs: health-monitor: restarting (reason: disconnected) and (reason: stale-socket) every ~5 minutes for specific agents
typing TTL reached (2m); stopping typing indicator — agents start typing, get restarted mid-processing, response is lost
Agents using fast models (8B, ~5s response) are stable; only 32B agents are affected
All 15 bots share one IP, so simultaneous reconnects compound the problem

Root Cause (diagnosed)

Discord's gateway requires a Heartbeat (opcode 1) every ~45 seconds. When a bot is processing a long LLM call (30-60s on a 32B model), the event loop handling the WebSocket heartbeat appears to get blocked or starved. Discord marks the connection as zombie and drops it.

Secondary issues:

No reconnect jitter — health monitor restarts all dropped bots simultaneously, triggering Discord's max_concurrency limit (1 IDENTIFY per 5s per token) and potentially Cloudflare's IP-level rate limit (10k invalid requests/10min)
No state persistence — in-flight LLM responses are lost on restart, causing the "typed and then nothing" UX

Environment

Mac Studio (Apple Silicon), macOS
15 Discord bots on one OpenClaw gateway
Models: ollama/qwen3-agent:32b and ollama/coder-agent:32b (via local Ollama)
OpenClaw version: latest

Requested Fixes

Decouple heartbeat from agent processing — WebSocket keepalive should run on a dedicated thread/loop, never blocked by LLM calls
Staggered reconnect with jitter — when health monitor restarts multiple bots, introduce randomised delay (1-15s) between each IDENTIFY to avoid rate limit spikes
Optional: typing indicator loop — re-trigger typing every 9s while LLM is processing, so users know the bot is still thinking even after a reconnect

Happy to provide gateway logs if useful. This likely affects anyone running multiple agents with slow models on a single gateway.

extent analysis

Fix Plan

To address the issues, we'll implement the following fixes:

Decouple Heartbeat from Agent Processing:
- Create a separate thread for handling WebSocket heartbeats.
- Use a library like asyncio or threading to manage the heartbeat thread.

Example using asyncio:

import asyncio

async def send_heartbeat(websocket):
    while True:
        await websocket.send_json({"op": 1})  # Send heartbeat
        await asyncio.sleep(45)  # Wait 45 seconds

# Create a separate task for the heartbeat
async def main(websocket):
    asyncio.create_task(send_heartbeat(websocket))
    # Rest of your code...

Staggered Reconnect with Jitter:
- Introduce a random delay between 1-15 seconds when restarting multiple bots.
- Use a library like random to generate the delay.

Example:

import random
import time

def restart_bots(bots):
    for bot in bots:
        # Restart the bot
        # ...
        delay = random.randint(1, 15)
        time.sleep(delay)  # Wait for the random delay

Typing Indicator Loop (Optional):
- Re-trigger the typing indicator every 9 seconds while the LLM is processing.
- Use a library like asyncio to manage the typing indicator loop.

Example:

async def typing_indicator_loop(websocket):
    while True:
        await websocket.send_json({"op": 3, "d": {"type": 1}})  # Send typing indicator
        await asyncio.sleep(9)  # Wait 9 seconds

Verification

To verify the fixes, monitor the health monitor logs and WebSocket connections for the bots. Check for:

Reduced occurrences of health-monitor: restarting (reason: disconnected) and (reason: stale-socket) logs.
Stable connections for bots using slow models.
No simultaneous reconnects triggering Discord's max_concurrency limit or Cloudflare's IP-level rate limit.

Extra Tips

Consider implementing a queueing system to handle in-flight LLM responses and prevent losses during restarts.
Monitor the performance of the bots and adjust the heartbeat interval and typing indicator loop as needed.
Keep the OpenClaw gateway and dependencies up-to-date to ensure the latest fixes and features.

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #LLM response #latency issue #model loading #dependency error #configuration error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

openclaw - 💡(How to fix) Fix Discord WebSocket drops every ~5min in multi-bot swarm (heartbeat starvation from slow LLM responses) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause (diagnosed)

Problem

Symptoms

Root Cause (diagnosed)

Environment

Requested Fixes

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

TRENDING

openclaw - 💡(How to fix) Fix Discord WebSocket drops every ~5min in multi-bot swarm (heartbeat starvation from slow LLM responses) [1 comments, 2 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause (diagnosed)

Problem

Symptoms

Root Cause (diagnosed)

Environment

Requested Fixes

extent analysis

Fix Plan

Verification

Extra Tips

Still need to ship something?

RELATED_DISCOVERY

TRENDING