hermes - 💡(How to fix) Fix RFC: Cross-process write-intent registry (proactive scheduler-level alternative/complement to #12684)

Official PRs (…)
ON THIS PAGE

Recommended Tools

×6

Utilities matched from this issue’s tags and category — try them while you read without losing context.

GitHub issue graph ai analysis

Paste a GitHub issue URL. We fetch that issue, discover linked issues from bodies/comments/timeline, collect linked pull requests, and produce a structured English report.

The report is written in English Markdown for sharing and archival.

Helpful · Quick feedback

Loading…

I'd like to gauge interest in a cross-process write-intent registry that prevents two agents / processes from racing on the same file by declaring write targets at task submission time rather than at file-open time.

This generalizes the surface that #12684 raises (concurrent save_trajectory() → JSONL corruption). Instead of adding fcntl/msvcrt locks file-by-file, declare each task's write set up-front so the scheduler can refuse / warn / queue conflicting submissions before any file is opened.

I've been running this as a --writes flag on a task dispatch CLI for ~3 weeks. Not perfect — but the class of "two workers stomped each other and we didn't notice for an hour" bugs has gone to zero in my deployment.

Error Message

This generalizes the surface that #12684 raises (concurrent save_trajectory() → JSONL corruption). Instead of adding fcntl/msvcrt locks file-by-file, declare each task's write set up-front so the scheduler can refuse / warn / queue conflicting submissions before any file is opened. 3. Conflict → refuse with a structured error naming the conflicting in-flight task ID, or queue behind it, or warn-and-allow (configurable per-deployment) conflict? ─yes→ policy: refuse | queue | warn

  • refuse (default for production): hard error, caller decides whether to retry
  • warn: admit anyway but log the overlap + post a notification (useful during migration)
  1. Scope — accept just refuse policy first (simplest), defer queue + warn?

Root Cause

  • Two workers can both start with conflicting plans, hit the lock in the middle, and end up with one worker's transactional change half-applied because it had to wait
  • The user / parent agent doesn't learn about the conflict until late (mid-execution stack trace)
  • Compound writes (rename + write + chmod) can't be made atomic with a single-file lock
  • Different process types (one spawning subprocess, one in-process) need different lock APIs

Fix Action

Fix / Workaround

I've been running this as a --writes flag on a task dispatch CLI for ~3 weeks. Not perfect — but the class of "two workers stomped each other and we didn't notice for an hour" bugs has gone to zero in my deployment.

  1. Relationship to #12684: would you prefer this RFC be merged as a comment on #12684 (since it offers an alternative or complementary fix), or kept separate as a broader proposal?
  2. In scope? Belongs in the dispatch / scheduler layer (alongside what's discussed in #31392) or in a lower-level abstraction?
  3. Storage — SQLite vs flat JSON for the registry? My current implementation uses flat JSON for simplicity; SQLite would make crash-recovery cleaner.
  4. Scope — accept just refuse policy first (simplest), defer queue + warn?

Code Example

┌──────────────────────────┐
│  task.submit(│    target="profile_b",│    writes_to=[".hermes/state.json""data/cache.db"],...)└──────────────────────────┘
┌──────────────────────────┐
│  in_flight_registry        (SQLite or flat JSON; ~100 byte / entry)
--------│  task_id  | path  | TTL│  td-001   | A     | 600s │
│  td-002   | B     | 600s │
└──────────────────────────┘
   conflict?  ─yes→  policy: refuse | queue | warn
              ─no →  insert + admit
RAW_BUFFERClick to expand / collapse

Summary

I'd like to gauge interest in a cross-process write-intent registry that prevents two agents / processes from racing on the same file by declaring write targets at task submission time rather than at file-open time.

This generalizes the surface that #12684 raises (concurrent save_trajectory() → JSONL corruption). Instead of adding fcntl/msvcrt locks file-by-file, declare each task's write set up-front so the scheduler can refuse / warn / queue conflicting submissions before any file is opened.

I've been running this as a --writes flag on a task dispatch CLI for ~3 weeks. Not perfect — but the class of "two workers stomped each other and we didn't notice for an hour" bugs has gone to zero in my deployment.

Why proactive registration beats reactive file locks

A reactive file lock (#12684's likely fix path) protects the byte writes but not the task semantics:

  • Two workers can both start with conflicting plans, hit the lock in the middle, and end up with one worker's transactional change half-applied because it had to wait
  • The user / parent agent doesn't learn about the conflict until late (mid-execution stack trace)
  • Compound writes (rename + write + chmod) can't be made atomic with a single-file lock
  • Different process types (one spawning subprocess, one in-process) need different lock APIs

Proactive registration shifts the conflict-detection point to before the worker spawns at all:

  1. Worker / scheduler calls submit(task, writes_to=[path_a, path_b]) to enter a task in the queue
  2. Submit-time check against an in-flight registry: are any of writes_to already claimed?
  3. Conflict → refuse with a structured error naming the conflicting in-flight task ID, or queue behind it, or warn-and-allow (configurable per-deployment)
  4. On task completion / timeout / cancellation → entries auto-released
  5. Crashed worker → entries stale for N seconds → auto-released by reaper

This is complementary to #12684's file-level locking, not a replacement: registration prevents most races at scheduling time; file locks remain a defense-in-depth at write time for cases where two workers genuinely have no scheduler in front of them.

Design sketch

┌──────────────────────────┐
│  task.submit(            │
│    target="profile_b",   │
│    writes_to=[           │
│      ".hermes/state.json"│
│      "data/cache.db"     │
│    ],                    │
│    ...                   │
│  )                       │
└──────────────────────────┘
┌──────────────────────────┐
│  in_flight_registry      │  (SQLite or flat JSON; ~100 byte / entry)
│  --------                │
│  task_id  | path  | TTL  │
│  td-001   | A     | 600s │
│  td-002   | B     | 600s │
└──────────────────────────┘
   conflict?  ─yes→  policy: refuse | queue | warn
              ─no →  insert + admit

Policies (per-deployment config):

  • refuse (default for production): hard error, caller decides whether to retry
  • queue: hold submission; admit when in-flight releases
  • warn: admit anyway but log the overlap + post a notification (useful during migration)

Auto-release triggers:

  • Task status → completed / failed / cancelled
  • Task TTL exceeded
  • Worker process death detected (heartbeat lapses)

Telemetry:

Per-conflict log entry → users can find "what's stomping what" without grepping.

What this is NOT

  • Not a transactional store (no rollback on partial write)
  • Not a substitute for the file-level lock #12684 wants — the two complement each other
  • Not a permissions system (registration is advisory; nothing forces the worker to actually limit its writes to the declared set)
  • Not a workflow engine; pure scheduling-time admission control

Use case it solves for me

My deployment has multiple worker profiles that occasionally need to write the same state file. Before --writes, two workers could both grab the file, one's transactional update would clobber the other's, and we'd find out via an inconsistency hours later. Now: submit-time refusal with the conflicting task ID, immediate visibility, zero corruption.

For trajectory writes specifically (#12684), the worker would declare writes_to=[<trajectory_path>] at spawn; concurrent trajectory writers would be serialized at submit time, eliminating the JSONL interleave at the source.

Questions before I open anything

  1. Relationship to #12684: would you prefer this RFC be merged as a comment on #12684 (since it offers an alternative or complementary fix), or kept separate as a broader proposal?
  2. In scope? Belongs in the dispatch / scheduler layer (alongside what's discussed in #31392) or in a lower-level abstraction?
  3. Storage — SQLite vs flat JSON for the registry? My current implementation uses flat JSON for simplicity; SQLite would make crash-recovery cleaner.
  4. Scope — accept just refuse policy first (simplest), defer queue + warn?

Not opening a PR yet. Related batch: #31385 (bridge), #31387 (drift hook, withdrawn), #31388 (multi-profile memory), #31392 (task relay), and parallel proposals on SKILL scheduling + output compressor I'm filing alongside this.

Thanks!

Vote matrix · Quick signals

Works
Did the solution work? Tap to confirm.
Easy Fix
Was it a quick fix?
Time Saver
Did it save you time?
Blocking
Was it severely blocking?
Common Issue
Are others likely hitting this too?
Flaky / Intermittent
Is it intermittent?
Verified / Reproducible
Can you reproduce it reliably?
Loading…

Still need to ship something?

×6

Another batch ranked right after the header list — different links, same matching logic.

Back to top recommendations

TRENDING

hermes - 💡(How to fix) Fix RFC: Cross-process write-intent registry (proactive scheduler-level alternative/complement to #12684)