Skip to main content
← Back to list
01Issue
FeatureOpenSwamp CLI
AssigneesNone

Relationships

#829 Intra-namespace write concurrency: whole-index sync under the lock serializes fan-out workloads (split from shipped #666)

Opened by mgreten · 6/26/2026

Summary

On a single high-churn namespace, datastore write throughput is gated by something downstream of the per-namespace lock: every write pulls and pushes the whole index under the lock, so lock-hold scales with total index size rather than with the partition a write touches. For a workload that fans many concurrent writers into one namespace, this serializes everything and produces 60s lock-acquire timeouts.

Per-namespace locking (#666, shipped) solved contention between namespaces and is great. This is the remaining intra-namespace concurrency problem, split out from #666 since that issue is marked shipped. Filing it on its own so it can be tracked, with the full workload context that I don't think was ever on the record.

The workload (the part that wasn't communicated before)

I run an automated dev-workflow system. The shape that matters for the datastore:

  • Many pipeline runs execute concurrently, each in its own git worktree (isolated filesystem, branch, ports).
  • Each run is event-driven: it historically emitted a datastore write on every phase transition and (for some subscribers) every agent call — so a single run produced dozens of writes.
  • All of those writes funnel into one namespace. The worktrees isolate the filesystem, but every run points its datastore calls at the same control-plane repo → same namespace → same lock. Worktree isolation is filesystem isolation, not datastore isolation.
  • Separately, ~13 scheduled jobs on one machine plus a second machine write that same namespace continuously.

So per-namespace locking doesn't help this workload: the entire concurrent fan-out lives inside one namespace and serializes on its lock.

Measurements

Sampling the lock object every 2s and attributing each hold to its command, on aligned latest versions:

  • Lock-hold p50 ≈ 94–102s, max 268s, across a ~20-minute window.
  • The longest holds were spread across completely different operations — a notification send, a small provider lookup, a poller refresh — all converging on the same ~268s ceiling. That uniformity is the tell: the duration tracks the shared whole-index sync, not what the operation does.
  • Index was ~93 MB; a single high-cardinality data stream was ~40% of it, inflating the hold for every writer.

Concrete impact: a batch of back-to-back runs that should complete in roughly the low single-digit hours instead ran overnight — the wall-clock was dominated by writers waiting on the lock, not by the actual work.

What I changed on my side (so this is scoped to what's genuinely the datastore's)

I've already removed the avoidable share of this:

  1. Collapsed a per-invocation unique-named artifact (resolve-<phase>-<timestamp>) to a stable per-phase name — it was thousands of write-only index entries nothing read back.
  2. Aligned the write unit with the run unit: telemetry-class writes are now buffered in order during a run and replayed as a single batch at completion, so one run holds the lock once instead of dozens of times.

Those cut the number of lock acquisitions dramatically. But the per-acquisition cost (whole-index sync) is unchanged, so concurrent runs and the scheduled writers still serialize on the single lock — that part is the datastore's to solve, which is why I'm filing it.

Two directions that would each independently unblock fan-out workloads

Thinking out loud, not prescribing:

  1. Incremental / scoped index sync — if a write synced only the partition it touched rather than the whole index, same-namespace concurrent writers would stop blocking each other for tens of seconds. This is the more general fix and helps every at-scale user, not just fan-out ones. (The partitioned _index/ shards already exist; this would be making the sync under the lock honor that partitioning.)

  2. A lightweight per-run / ephemeral namespace primitive — let a short-lived job cheaply get its own lock scope and fold its data into a parent namespace afterward, without standing up a separate repo checkout and recreating model instances by hand. Today the only way to get a second lock is a second checkout, which is too heavy for a per-run pattern.

What I'm considering if neither lands

If intra-namespace concurrency stays serialized, I'll likely have to break my workload apart at the repo level — a separate checkout (hence separate namespace + lock) per writer-class (pollers vs. pipeline vs. the second machine), and possibly an ephemeral per-run namespace that I provision and tear down myself, harvesting each run's data into a central analytics namespace afterward. That works (my analytics already reassembles from an on-disk source of truth, so runs can live in any namespace), but it's a lot of self-managed namespace plumbing to work around the lock — exactly the kind of thing a primitive like (2), or simply (1), would make unnecessary.

Happy to share the lock-sampling script, the per-run write trace, or profiling data if any of it would help. Thanks again for all the recent datastore work — #788 and the per-namespace locks have both been real improvements even as I work through this.

Environment

  • @swamp/s3-datastore@2026.06.24.1
  • swamp 20260625.225837.0
  • MinIO backend, two writer machines sharing one bucket
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

6/26/2026, 3:52:46 AM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.