Skip to main content
← Back to list
01Issue
BugShippedSwamp CLI
Assigneesstack72

Relationships

#788 swamp data gc prunes the catalog but never deletes objects from S3 datastores (markDirty hook not wired) — sync manifest never shrinks

Opened by mgreten · 6/23/2026· Shipped 6/24/2026

Summary

On an S3-backed datastore, swamp data gc prunes the local catalog (data-version index) but never deletes the underlying objects from S3. It reports a large bytesReclaimed, but the objects remain in the bucket and the file-level sync manifest (.datastore-index.json) — which enumerates every object under the prefix and is pulled/pushed under the global lock on every sync — does not shrink. So GC does not relieve the sync/lock cost it appears to address.

This is distinct from the global-lock contention in #666 (lock scope). This is the GC ↔ datastore object lifecycle: reclaimed versions are orphaned in S3.

Root cause (swamp core)

createDataGcDeps() in src/libswamp/data/gc.ts constructs the data repository without a markDirty hook:

const unifiedDataRepo = new FileSystemUnifiedDataRepository(
  repoDir,
  dsPath(SWAMP_SUBDIRS.data),
  catalogStore,            // <- 4th arg (markDirty) omitted
);

FileSystemUnifiedDataRepository.delete() calls notifyDirty(), which is a no-op when the hook is absent (src/infrastructure/persistence/unified_data_repository.ts):

private async notifyDirty(relPath?: string): Promise<void> {
  if (this.markDirty) await this.markDirty(relPath);   // markDirty undefined under GC
}

With no dirty signal, the end-of-command sync push has nothing to propagate, so the S3 DELETEs are never issued. Additionally, for lazy-cached S3 datastores the version content isn't present locally, so Deno.remove(versionDir) hits NotFound and only the catalog row (catalogStore.removeVersion) is dropped. bytesReclaimed is computed from catalog/size accounting, not from bytes actually freed in the backend.

Reproduction

  1. Repo on @swamp/s3-datastore (MinIO), namespaced, with substantial version history.
  2. swamp data gc --force → logs e.g. deleted 7150 expired items, 100621 excess versions reclaimed (192171800 bytes) then push complete, no changes.
  3. Inspect the bucket: object count and total size under the prefix are unchanged; {prefix}/.datastore-index.json is rewritten at the same (large) size.

Observed (real numbers)

  • GC reported ~183 MB / 100k+ versions reclaimed; sync logged push complete, no changes.
  • Post-GC catalog: ~10k rows / ~5 MB.
  • Bucket reality, same prefix: 321,678 objects / 470 MB (…/data/ alone 281,583 objects / 270 MB); {prefix}/.datastore-index.json still 154 MB, rewritten on the GC push.
  • Sync wall-time unchanged — the global-lock hold is gated by the 154 MB manifest, which still lists all ~321k objects.

Expected

GC should physically delete reclaimed objects from the datastore backend (or mark them dirty so the push deletes them), so that object count, total size, and the sync manifest all shrink — and bytesReclaimed reflects bytes actually freed.

Impact

The mitigation users are told to apply for large-index sync cost (swamp data gc) silently has no effect on S3 datastores. Reclaimed-version objects accumulate forever, the per-file sync manifest keeps growing, and global-lock hold time per sync never improves. Directly compounds #666.

Notes

  • A monolithic root .datastore-index.json (103 MB) continues to be rewritten after a namespace migration even though the repo is namespaced — possibly related (the manifest layer seems to track objects independently of catalog/namespace state).
  • ephemeral lifetime logs Ephemeral lifetime is not yet implemented, so ephemeral-tagged data is never collected either.

Environment

  • swamp 20260617.212026.0-sha.396e0952
  • @swamp/s3-datastore (MinIO backend, conditional-write locking), darwin/arm64, Deno 2.8.x
  • Adjacent to #666 (single global datastore lock).
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 3 MOREFINDINGS+ 3 MOREPR_MERGED+ 1 MORECONTRIBUTOR_NOTIFIED

Shipped

6/24/2026, 3:04:49 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack726/23/2026, 6:33:20 PM
Editable. Press Enter to edit.

mgreten commented 6/23/2026, 8:38:37 PM

New evidence: GC leaves objects in the local cache too, so manual S3 cleanup is silently undone by the next push — and declared per-artifact retention is never enforced

Concrete reproduction on a shared MinIO S3 datastore (namespace agentic-tooling). Three findings that extend the original report.

1. swamp data gc reclaims catalog rows but deletes objects from neither S3 nor the local repo cache.

After swamp data gc --force (reported "100,621 excess versions reclaimed, 192 MB"), the underlying objects remained in both places:

  • S3: dead-version objects still present under agentic-tooling/data/...
  • Local cache: ~/.swamp/repos/<repoId>/ still held the full set — 330,242 files / 1.6 GB (141,536 of them from a single high-frequency model).

So "reclaimed" only mutated the catalog; the object set — and therefore .datastore-index.json size — was unchanged. This matches the markDirty-hook root cause already identified here.

2. Manual S3 deletion is healed by re-push from the local cache — the cleanup cannot hold.

Because ~/.swamp/repos/<repoId>/ is authoritative and push reconciles S3 up to match it, deleting objects directly from the bucket is silently reverted:

  • Manually deleted ~256k dead-version objects from S3, rebuilt the index → agentic-tooling/.datastore-index.json dropped 154 MB → 22 MB, data/ 282k → 25k objects. Looked fixed.
  • The very next push from the owning machine (an unrelated vault put, plus routine pollers) re-uploaded the deleted objects: ~193k objects re-written in a ~10-minute window, dominated by one model's snapshots (~122k).
  • The index re-bloated 22 MB → 145 MB → 255 MB and kept climbing, converging back toward the local cache's full object count.

Net: there is currently no operator-accessible way to shrink the object set — GC won't delete objects (catalog-only), and manual bucket deletion is re-pushed from the unpruned local cache.

3. Declared per-artifact retention is never enforced.

The bloat is dominated by a poller whose artifacts explicitly declare retention:

lifetime: 30d
garbageCollection: 100   # keep 100 versions

…yet individual data names have 10,553 / 10,535 / 7,317 versions on disk. The policy is recorded in every version's metadata.yaml but nothing prunes to it — consistent with the prune decision being computed while the delete never reaches S3 or the local cache.

Interaction with #666. The unbounded index this produces is what makes the per-namespace structural lock holds long: a ~1 KB vault put held .locks/agentic-tooling.lock for ~6 minutes while pushing the inflated manifest, which timed out a scheduled workflow on a second host sharing the namespace (Workflow execution failed: Lock ".locks/agentic-tooling.lock" ... timed out after 60s). So #788 (object/cache bloat) and #666 (single structural lock) compound each other.

Suggested fix scope. GC's object deletion needs to cover (a) S3 objects and (b) the local repo cache under ~/.swamp/repos/<repoId>/, and the per-artifact garbageCollection / lifetime policy should actually drive pruning. Otherwise retention settings are advisory only and the manifest grows without bound, and any manual cleanup is undone on the next sync.

Environment

  • swamp 20260617.212026.0 (CLI); older binaries 20260609 / 20260516 also in the mix on a second host
  • @swamp/s3-datastore (MinIO backend), namespace agentic-tooling, shared across two machines
  • Affected model: a high-frequency GitHub PR-snapshot poller writing one new version per refresh

stack72 commented 6/24/2026, 3:07:39 PM

Thanks @mgreten for reporting this! The fix has been merged and a release is on its way. We appreciate your contribution to swamp.

mgreten commented 6/24/2026, 7:02:34 PM

Confirming the fix shipped in @swamp/s3-datastore@2026.06.24.1 — thank you for the fast turnaround. Picked it up via swamp extension update @swamp/s3-datastore (the binary swamp update / swamp repo upgrade don't touch the pulled extension, so the lockfile pin had to be bumped explicitly — worth a mention for anyone else tracking the fix).

Will report back once a real data gc runs against the shared backend and confirm the .datastore-index.json manifest actually shrinks (it had grown to ~110 MB / ~262k objects, which was also the dominant lock-hold cost in #666). Early sign is good: post-update the winning sync's lock hold dropped from ~117s to ~50s. Appreciate it.

mgreten commented 6/24/2026, 7:35:50 PM

Verified on @swamp/s3-datastore@2026.06.24.1 — the fix works. Ran a real swamp data gc -f against a namespace that had accumulated a lot of expired poller history:

  • ~104,352 versions / 7,799 entries / ~199 MB reclaimed, all duration-expired.
  • Follow-up gc --dry-run now reports 0 expired entries — confirming the objects were actually deleted from the backend and the catalog shrank durably (previously this churn never cleared). 👍

One operational note for anyone GC'ing a large backlog: the gc deletions committed to the local catalog fine, but the post-delete push to S3 timed out at the default 300000ms (it had to rewrite the large index + delete the objects in one push), so gc exited non-zero with a trailing push-timeout error even though the deletions were durable. Re-running with SWAMP_DATASTORE_SYNC_TIMEOUT_MS=1800000 set on the command completed cleanly. Might be worth either bumping the implicit-push timeout for gc specifically, or noting in the gc docs that a large first GC needs the env var. (This is the same monolithic-index-under-lock cost as #666.)

Thanks again — great to see this land.

Sign in to post a ripple.