hermes - 💡(How to fix) Fix RFC: Cross-process write-intent registry (proactive scheduler-level alternative/complement to #12684)

I'd like to gauge interest in a cross-process write-intent registry that prevents two agents / processes from racing on the same file by declaring write targets at task submission time rather than at file-open time.

I've been running this as a --writes flag on a task dispatch CLI for ~3 weeks. Not perfect — but the class of "two workers stomped each other and we didn't notice for an hour" bugs has gone to zero in my deployment.

Error Message

This generalizes the surface that #12684 raises (concurrent save_trajectory() → JSONL corruption). Instead of adding fcntl/msvcrt locks file-by-file, declare each task's write set up-front so the scheduler can refuse / warn / queue conflicting submissions before any file is opened. 3. Conflict → refuse with a structured error naming the conflicting in-flight task ID, or queue behind it, or warn-and-allow (configurable per-deployment) conflict? ─yes→ policy: refuse | queue | warn

refuse (default for production): hard error, caller decides whether to retry
warn: admit anyway but log the overlap + post a notification (useful during migration)

Scope — accept just refuse policy first (simplest), defer queue + warn?

Root Cause

Two workers can both start with conflicting plans, hit the lock in the middle, and end up with one worker's transactional change half-applied because it had to wait
The user / parent agent doesn't learn about the conflict until late (mid-execution stack trace)
Compound writes (rename + write + chmod) can't be made atomic with a single-file lock
Different process types (one spawning subprocess, one in-process) need different lock APIs

Fix Action

Fix / Workaround

Relationship to #12684: would you prefer this RFC be merged as a comment on #12684 (since it offers an alternative or complementary fix), or kept separate as a broader proposal?
In scope? Belongs in the dispatch / scheduler layer (alongside what's discussed in #31392) or in a lower-level abstraction?
Storage — SQLite vs flat JSON for the registry? My current implementation uses flat JSON for simplicity; SQLite would make crash-recovery cleaner.
Scope — accept just refuse policy first (simplest), defer queue + warn?

Code Example

┌──────────────────────────┐
│  task.submit(            │
│    target="profile_b",   │
│    writes_to=[           │
│      ".hermes/state.json"│
│      "data/cache.db"     │
│    ],                    │
│    ...                   │
│  )                       │
└──────────────────────────┘
              │
              ▼
┌──────────────────────────┐
│  in_flight_registry      │  (SQLite or flat JSON; ~100 byte / entry)
│  --------                │
│  task_id  | path  | TTL  │
│  td-001   | A     | 600s │
│  td-002   | B     | 600s │
└──────────────────────────┘
              │
              ▼
   conflict?  ─yes→  policy: refuse | queue | warn
              ─no →  insert + admit

Summary

Why proactive registration beats reactive file locks

A reactive file lock (#12684's likely fix path) protects the byte writes but not the task semantics:

Two workers can both start with conflicting plans, hit the lock in the middle, and end up with one worker's transactional change half-applied because it had to wait
The user / parent agent doesn't learn about the conflict until late (mid-execution stack trace)
Compound writes (rename + write + chmod) can't be made atomic with a single-file lock
Different process types (one spawning subprocess, one in-process) need different lock APIs

Proactive registration shifts the conflict-detection point to before the worker spawns at all:

Worker / scheduler calls submit(task, writes_to=[path_a, path_b]) to enter a task in the queue
Submit-time check against an in-flight registry: are any of writes_to already claimed?
Conflict → refuse with a structured error naming the conflicting in-flight task ID, or queue behind it, or warn-and-allow (configurable per-deployment)
On task completion / timeout / cancellation → entries auto-released
Crashed worker → entries stale for N seconds → auto-released by reaper

This is complementary to #12684's file-level locking, not a replacement: registration prevents most races at scheduling time; file locks remain a defense-in-depth at write time for cases where two workers genuinely have no scheduler in front of them.

Design sketch

┌──────────────────────────┐
│  task.submit(            │
│    target="profile_b",   │
│    writes_to=[           │
│      ".hermes/state.json"│
│      "data/cache.db"     │
│    ],                    │
│    ...                   │
│  )                       │
└──────────────────────────┘
              │
              ▼
┌──────────────────────────┐
│  in_flight_registry      │  (SQLite or flat JSON; ~100 byte / entry)
│  --------                │
│  task_id  | path  | TTL  │
│  td-001   | A     | 600s │
│  td-002   | B     | 600s │
└──────────────────────────┘
              │
              ▼
   conflict?  ─yes→  policy: refuse | queue | warn
              ─no →  insert + admit

Policies (per-deployment config):

refuse (default for production): hard error, caller decides whether to retry
queue: hold submission; admit when in-flight releases
warn: admit anyway but log the overlap + post a notification (useful during migration)

Auto-release triggers:

Task status → completed / failed / cancelled
Task TTL exceeded
Worker process death detected (heartbeat lapses)

Telemetry:

Per-conflict log entry → users can find "what's stomping what" without grepping.

What this is NOT

Not a transactional store (no rollback on partial write)
Not a substitute for the file-level lock #12684 wants — the two complement each other
Not a permissions system (registration is advisory; nothing forces the worker to actually limit its writes to the declared set)
Not a workflow engine; pure scheduling-time admission control

Use case it solves for me

My deployment has multiple worker profiles that occasionally need to write the same state file. Before --writes, two workers could both grab the file, one's transactional update would clobber the other's, and we'd find out via an inconsistency hours later. Now: submit-time refusal with the conflicting task ID, immediate visibility, zero corruption.

For trajectory writes specifically (#12684), the worker would declare writes_to=[<trajectory_path>] at spawn; concurrent trajectory writers would be serialized at submit time, eliminating the JSONL interleave at the source.

Questions before I open anything

Relationship to #12684: would you prefer this RFC be merged as a comment on #12684 (since it offers an alternative or complementary fix), or kept separate as a broader proposal?
In scope? Belongs in the dispatch / scheduler layer (alongside what's discussed in #31392) or in a lower-level abstraction?
Storage — SQLite vs flat JSON for the registry? My current implementation uses flat JSON for simplicity; SQLite would make crash-recovery cleaner.
Scope — accept just refuse policy first (simplest), defer queue + warn?

Not opening a PR yet. Related batch: #31385 (bridge), #31387 (drift hook, withdrawn), #31388 (multi-profile memory), #31392 (task relay), and parallel proposals on SKILL scheduling + output compressor I'm filing alongside this.

Thanks!

Data

Security

Network

Code

UI/UX

Text

System

Multimedia

Protocol

API

Engineering