hermes - 💡(How to fix) Fix [OPS] CPU spikes to 144%+ due to uncoordinated kanban worker CI parallelism [1 participants]

Seven74AI · 2026-05-19T11:36:38Z

[hermes] Root Cause Analysis Observed Load average 7.58–8.93 on a 4-core VM 189–223% utilization , with CPU pressure some avg10 at 14.88 and sustained avg300 a… ## Root Cause Analysis ### Observed Load average **7.58–8.93 on a 4-core VM** (189–223% utilization), with CPU pressure some avg10 at **14.88** and sustained avg300 at **48.16**. ### Top CPU Consumers (sampled Mon 13:34 UTC) | Process | CPU% | Board | What | |---|---|---|---| | `tsc --noEmit` | 134% | the-swarm | TypeScript type-check on SoldierPanel.ts | | `npm exec tsc --noEmit` | 87.5% | shop | TypeScript type-check on user-validation.test.ts | | `vitest` (3 workers) | 69% + 51% + 50% | shop | Test runner with 3 parallel workers | | `vitest` | 13.3% | music-library | Test runner | | `tsserver.js` x3 | ~48% total | music-library, shop | TypeScript language servers | **17 Hermes kanban worker Python processes** active across 10 boards. Multiple boards running CI simultaneously — each spawning `tsc --noEmit` + `vitest` with no inter-board coordination. ### Memory Pressure - 6.5G / 7.8G used (83%) - 6.4G / 14G swap used — significant swap thrashing adding IO pressure (IO some avg10: 1.93) ### Root Cause Chain 1. `max_spawn=5` allows 5 concurrent kanban workers (per board? or global?) 2. Multiple boards run `project-ci` workflow simultaneously → each spawns `tsc --noEmit` + `vitest` 3. `tsc` processes are CPU-bound (~130% each), `vitest` workers compete for remaining cores 4. No admission control to throttle CPU-intensive CI steps when system is overloaded 5. Memory exhaustion leads to swap → compounding IO wait ### Impact - Worker timeouts and slowdowns - Increased failure rate from resource starvation - Potential cascading failures when timed-out workers retry ### Recommended Actions 1. **Immediate**: Reduce `max_spawn` to 2–3 until per-worker CPU cgroup limits are in place 2. **Short-term**: Add CPU load guard — workers should check loadavg before spawning CI steps 3. **Short-term**: Serialize `tsc` across boards (single file-lock or queue for type-checking) 4. **Medium-term**: cgroup CPU limits per kanban worker process 5. **Medium-term**: `vitest` workers should be capped at 1 when system load > 75% ### System Details - Host: Linux 6.8.0-117-generic, 4 CPUs, 7.8G RAM, 14G swap - Disk: 72G, 75% used - 10 active kanban boards

hermes2026-05-19 11:36:38

ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

GitHub issue URL

Helpful · Quick feedback

GitHub stats

NousResearch/hermes-agent#28706•Fetched 2026-05-20 04:02:25

View on GitHub

Comments

Participants

Timeline

Reactions

Author

Seven74AI

Participants

Seven74AI

Timeline (top)

labeled ×4cross-referenced ×2closed ×1

Root Cause

Root Cause Analysis

RAW_BUFFERClick to expand / collapse

Root Cause Analysis

Observed

Load average 7.58–8.93 on a 4-core VM (189–223% utilization), with CPU pressure some avg10 at 14.88 and sustained avg300 at 48.16.

Top CPU Consumers (sampled Mon 13:34 UTC)

Process	CPU%	Board	What
`tsc --noEmit`	134%	the-swarm	TypeScript type-check on SoldierPanel.ts
`npm exec tsc --noEmit`	87.5%	shop	TypeScript type-check on user-validation.test.ts
`vitest` (3 workers)	69% + 51% + 50%	shop	Test runner with 3 parallel workers
`vitest`	13.3%	music-library	Test runner
`tsserver.js` x3	~48% total	music-library, shop	TypeScript language servers

17 Hermes kanban worker Python processes active across 10 boards. Multiple boards running CI simultaneously — each spawning tsc --noEmit + vitest with no inter-board coordination.

Memory Pressure

6.5G / 7.8G used (83%)
6.4G / 14G swap used — significant swap thrashing adding IO pressure (IO some avg10: 1.93)

Root Cause Chain

max_spawn=5 allows 5 concurrent kanban workers (per board? or global?)
Multiple boards run project-ci workflow simultaneously → each spawns tsc --noEmit + vitest
tsc processes are CPU-bound (~130% each), vitest workers compete for remaining cores
No admission control to throttle CPU-intensive CI steps when system is overloaded
Memory exhaustion leads to swap → compounding IO wait

Impact

Worker timeouts and slowdowns
Increased failure rate from resource starvation
Potential cascading failures when timed-out workers retry

Recommended Actions

Immediate: Reduce max_spawn to 2–3 until per-worker CPU cgroup limits are in place
Short-term: Add CPU load guard — workers should check loadavg before spawning CI steps
Short-term: Serialize tsc across boards (single file-lock or queue for type-checking)
Medium-term: cgroup CPU limits per kanban worker process
Medium-term: vitest workers should be capped at 1 when system load > 75%

System Details

Host: Linux 6.8.0-117-generic, 4 CPUs, 7.8G RAM, 14G swap
Disk: 72G, 75% used
10 active kanban boards

Vote matrix · Quick signals

Works

Did the solution work? Tap to confirm.

Easy Fix

Was it a quick fix?

Time Saver

Did it save you time?

Blocking

Was it severely blocking?

Common Issue

Are others likely hitting this too?

Flaky / Intermittent

Is it intermittent?

Verified / Reproducible

Can you reproduce it reliably?

#api #ssr #installation #tensor shape #autograd error

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering

hermes - 💡(How to fix) Fix [OPS] CPU spikes to 144%+ due to uncoordinated kanban worker CI parallelism [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis

Root Cause Analysis

Observed

Top CPU Consumers (sampled Mon 13:34 UTC)

Memory Pressure

Root Cause Chain

Impact

Recommended Actions

System Details

Still need to ship something?

TRENDING

hermes - 💡(How to fix) Fix [OPS] CPU spikes to 144%+ due to uncoordinated kanban worker CI parallelism [1 participants]

Recommended Tools

GitHub issue graph ai analysis

Root Cause

Root Cause Analysis

Root Cause Analysis

Observed

Top CPU Consumers (sampled Mon 13:34 UTC)

Memory Pressure

Root Cause Chain

Impact

Recommended Actions

System Details

Still need to ship something?

RELATED_DISCOVERY

TRENDING