Skip to main content
← Back to list
01Issue
BugShippedExtensions
Assigneesstack72

#222 Redundant push after pullChanged: localMtime clobbered by pullIndex(forceRemote)

Opened by stack72 · 5/4/2026· Shipped 5/4/2026

Problem

The first explicit `pushChanged` call after a `pullChanged` against a populated remote re-uploads every just-downloaded file with byte-identical content. Discovered during verification of swamp lab #220 / PR https://github.com/systeminit/swamp/pull/1290 — was previously masked by the coordinator's implicit pull+push fast-pathing past the explicit ones, became visible after #220 decoupled `swamp datastore sync` from the coordinator's implicit phase.

Not a correctness issue (data is byte-identical and the system reaches steady state after the first redundant push), but a bandwidth waste of N S3 PUTs once per fresh-cache bootstrap, and visible to users as a confusing `filesPushed: 17` in sync output immediately after a successful `filesPulled: 17`.

Reproduction

  1. Run a workflow N times against an S3 datastore from machine A, populating the bucket.
  2. On a fresh machine B, run `swamp datastore setup extension @swamp/s3-datastore --skip-migration` against the same bucket. After PR #1290 this hydrates the cache and reports `filesPulled: 17`.
  3. Run `swamp datastore sync --push` immediately. Reports `filesPushed: 17` even though the cache already matches the remote — those exact files were just downloaded.
  4. Run `swamp datastore sync` (default) again. Now correctly reports `0/0` — steady state reached after the redundant push completes and rewrites the index with proper localMtime values.

Verified end-to-end against MinIO.

Root cause (traced through extensions/datastores/_lib/s3_cache_sync.ts)

`pullChanged` (around line 808-816) sets `localMtime` on each downloaded entry IN MEMORY only. When `pulled > 0`, the sidecar isn't updated (line 843 only calls `markSynced` for the zero-diff case), and the local `.datastore-index.json` file isn't rewritten — so those `localMtime` values live exclusively in the `S3CacheSyncService` instance's in-memory state.

The next `pushChanged` (around line 920) calls `pullIndex({ forceRemote: true })`, which inside `pullIndex` (line 691-692) does `this.index = JSON.parse(text)`. The remote payload doesn't carry `localMtime` (those are local-only state), so the assignment overwrites the in-memory index and wipes the values `pullChanged` just set.

The walk in `pushChanged` (line 941-955) then sees `existing.size === stat.size` true and `existing.localMtime === undefined` — which SHOULD hit the `continue` at line 951 and skip the file. Empirically it doesn't, and the file ends up in `toPush`. I haven't fully resolved why; possibilities include a control-flow read I'm missing, or an interaction with `scrubIndex`/`indexMutated` that I haven't traced. Either way the observable effect is N redundant uploads.

Suggested fixes (pick one)

  1. Persist `localMtime` after `pullChanged` — write the in-memory index back to disk at the end of a non-zero pull, so the values survive across `pullIndex(forceRemote)` calls. Cleanest semantically.
  2. Merge `localMtime` values when `pullIndex(forceRemote)` overwrites the index — preserve the in-memory `localMtime` map across the refresh, since it's local-only state the remote can't authoritatively replace.
  3. Track downloaded files in a separate just-pulled set within the service instance and have `pushChanged` skip them on the first call after a pull.

Options 1 and 2 fit the existing fast-path / sidecar design more cleanly.

Context

  • Discovery: swamp lab #220 / PR https://github.com/systeminit/swamp/pull/1290 (now shipped).
  • Cross-link: `design/datastores.md` already documents the markDirty + sidecar contract; this issue is consistent with that design but reveals a gap in the pull→push handoff.
  • Repro harness from #220 (MinIO + scratch repos A/B, populate bucket, observe `filesPushed` mismatch) is reusable.

Upstream repository: https://github.com/systeminit/swamp-extensions

Environment

  • Extension: @swamp/[email protected]
  • swamp: 20260501.234710.0-sha.f1687b62
  • OS: darwin (aarch64)
  • Deno: 2.7.14
  • Shell: /bin/zsh
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 8 MOREREVIEW+ 3 MOREPR_MERGEDSHIPPED

Shipped

5/4/2026, 4:40:52 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack725/4/2026, 3:39:26 PM

Sign in to post a ripple.