hermes - 💡(How to fix) Fix [OPS] CPU spikes to 144%+ due to uncoordinated kanban worker CI parallelism [1 participants]

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…
GitHub stats
NousResearch/hermes-agent#28706Fetched 2026-05-20 04:02:25
View on GitHub
Comments
0
Participants
1
Timeline
7
Reactions
0
Author
Participants
Timeline (top)
labeled ×4cross-referenced ×2closed ×1

Root Cause

Root Cause Analysis

RAW_BUFFERClick to expand / collapse

Root Cause Analysis

Observed

Load average 7.58–8.93 on a 4-core VM (189–223% utilization), with CPU pressure some avg10 at 14.88 and sustained avg300 at 48.16.

Top CPU Consumers (sampled Mon 13:34 UTC)

ProcessCPU%BoardWhat
tsc --noEmit134%the-swarmTypeScript type-check on SoldierPanel.ts
npm exec tsc --noEmit87.5%shopTypeScript type-check on user-validation.test.ts
vitest (3 workers)69% + 51% + 50%shopTest runner with 3 parallel workers
vitest13.3%music-libraryTest runner
tsserver.js x3~48% totalmusic-library, shopTypeScript language servers

17 Hermes kanban worker Python processes active across 10 boards. Multiple boards running CI simultaneously — each spawning tsc --noEmit + vitest with no inter-board coordination.

Memory Pressure

  • 6.5G / 7.8G used (83%)
  • 6.4G / 14G swap used — significant swap thrashing adding IO pressure (IO some avg10: 1.93)

Root Cause Chain

  1. max_spawn=5 allows 5 concurrent kanban workers (per board? or global?)
  2. Multiple boards run project-ci workflow simultaneously → each spawns tsc --noEmit + vitest
  3. tsc processes are CPU-bound (~130% each), vitest workers compete for remaining cores
  4. No admission control to throttle CPU-intensive CI steps when system is overloaded
  5. Memory exhaustion leads to swap → compounding IO wait

Impact

  • Worker timeouts and slowdowns
  • Increased failure rate from resource starvation
  • Potential cascading failures when timed-out workers retry

Recommended Actions

  1. Immediate: Reduce max_spawn to 2–3 until per-worker CPU cgroup limits are in place
  2. Short-term: Add CPU load guard — workers should check loadavg before spawning CI steps
  3. Short-term: Serialize tsc across boards (single file-lock or queue for type-checking)
  4. Medium-term: cgroup CPU limits per kanban worker process
  5. Medium-term: vitest workers should be capped at 1 when system load > 75%

System Details

  • Host: Linux 6.8.0-117-generic, 4 CPUs, 7.8G RAM, 14G swap
  • Disk: 72G, 75% used
  • 10 active kanban boards

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix [OPS] CPU spikes to 144%+ due to uncoordinated kanban worker CI parallelism [1 participants]