feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

On an S3-backed datastore, swamp data gc prunes the local catalog (data-version index) but never deletes the underlying objects from S3. It reports a large bytesReclaimed, but the objects remain in the bucket and the file-level sync manifest (.datastore-index.json) — which enumerates every object under the prefix and is pulled/pushed under the global lock on every sync — does not shrink. So GC does not relieve the sync/lock cost it appears to address.

This is distinct from the global-lock contention in #666 (lock scope). This is the GC ↔ datastore object lifecycle: reclaimed versions are orphaned in S3.

Root cause (swamp core)

createDataGcDeps() in src/libswamp/data/gc.ts constructs the data repository without a markDirty hook:

const unifiedDataRepo = new FileSystemUnifiedDataRepository(
  repoDir,
  dsPath(SWAMP_SUBDIRS.data),
  catalogStore,            // <- 4th arg (markDirty) omitted
);

FileSystemUnifiedDataRepository.delete() calls notifyDirty(), which is a no-op when the hook is absent (src/infrastructure/persistence/unified_data_repository.ts):

private async notifyDirty(relPath?: string): Promise<void> {
  if (this.markDirty) await this.markDirty(relPath);   // markDirty undefined under GC
}

With no dirty signal, the end-of-command sync push has nothing to propagate, so the S3 DELETEs are never issued. Additionally, for lazy-cached S3 datastores the version content isn't present locally, so Deno.remove(versionDir) hits NotFound and only the catalog row (catalogStore.removeVersion) is dropped. bytesReclaimed is computed from catalog/size accounting, not from bytes actually freed in the backend.

Reproduction

Repo on @swamp/s3-datastore (MinIO), namespaced, with substantial version history.
swamp data gc --force → logs e.g. deleted 7150 expired items, 100621 excess versions reclaimed (192171800 bytes) then push complete, no changes.
Inspect the bucket: object count and total size under the prefix are unchanged; {prefix}/.datastore-index.json is rewritten at the same (large) size.

Observed (real numbers)

GC reported ~183 MB / 100k+ versions reclaimed; sync logged push complete, no changes.
Post-GC catalog: ~10k rows / ~5 MB.
Bucket reality, same prefix: 321,678 objects / 470 MB (…/data/ alone 281,583 objects / 270 MB); {prefix}/.datastore-index.json still 154 MB, rewritten on the GC push.
Sync wall-time unchanged — the global-lock hold is gated by the 154 MB manifest, which still lists all ~321k objects.

Expected

GC should physically delete reclaimed objects from the datastore backend (or mark them dirty so the push deletes them), so that object count, total size, and the sync manifest all shrink — and bytesReclaimed reflects bytes actually freed.

Impact

The mitigation users are told to apply for large-index sync cost (swamp data gc) silently has no effect on S3 datastores. Reclaimed-version objects accumulate forever, the per-file sync manifest keeps growing, and global-lock hold time per sync never improves. Directly compounds #666.

Notes

A monolithic root .datastore-index.json (103 MB) continues to be rewritten after a namespace migration even though the repo is namespaced — possibly related (the manifest layer seems to track objects independently of catalog/namespace state).
ephemeral lifetime logs Ephemeral lifetime is not yet implemented, so ephemeral-tagged data is never collected either.

Environment

swamp 20260617.212026.0-sha.396e0952
@swamp/s3-datastore (MinIO backend, conditional-write locking), darwin/arm64, Deno 2.8.x
Adjacent to #666 (single global datastore lock).

02Bog Flow

Shipped

6/24/2026, 3:04:49 PM

Click a lifecycle step above to view its details.

03Sludge Pulse

stack72 assigned stack726/23/2026, 6:33:20 PM

mgreten commented 6/23/2026, 8:38:37 PM

New evidence: GC leaves objects in the local cache too, so manual S3 cleanup is silently undone by the next push — and declared per-artifact retention is never enforced

Concrete reproduction on a shared MinIO S3 datastore (namespace agentic-tooling). Three findings that extend the original report.

1. swamp data gc reclaims catalog rows but deletes objects from neither S3 nor the local repo cache.

After swamp data gc --force (reported "100,621 excess versions reclaimed, 192 MB"), the underlying objects remained in both places:

S3: dead-version objects still present under agentic-tooling/data/...
Local cache: ~/.swamp/repos/<repoId>/ still held the full set — 330,242 files / 1.6 GB (141,536 of them from a single high-frequency model).

So "reclaimed" only mutated the catalog; the object set — and therefore .datastore-index.json size — was unchanged. This matches the markDirty-hook root cause already identified here.

2. Manual S3 deletion is healed by re-push from the local cache — the cleanup cannot hold.

Because ~/.swamp/repos/<repoId>/ is authoritative and push reconciles S3 up to match it, deleting objects directly from the bucket is silently reverted:

Manually deleted ~256k dead-version objects from S3, rebuilt the index → agentic-tooling/.datastore-index.json dropped 154 MB → 22 MB, data/ 282k → 25k objects. Looked fixed.
The very next push from the owning machine (an unrelated vault put, plus routine pollers) re-uploaded the deleted objects: ~193k objects re-written in a ~10-minute window, dominated by one model's snapshots (~122k).
The index re-bloated 22 MB → 145 MB → 255 MB and kept climbing, converging back toward the local cache's full object count.

Net: there is currently no operator-accessible way to shrink the object set — GC won't delete objects (catalog-only), and manual bucket deletion is re-pushed from the unpruned local cache.

3. Declared per-artifact retention is never enforced.

The bloat is dominated by a poller whose artifacts explicitly declare retention:

lifetime: 30d
garbageCollection: 100   # keep 100 versions

…yet individual data names have 10,553 / 10,535 / 7,317 versions on disk. The policy is recorded in every version's metadata.yaml but nothing prunes to it — consistent with the prune decision being computed while the delete never reaches S3 or the local cache.

Interaction with #666. The unbounded index this produces is what makes the per-namespace structural lock holds long: a ~1 KB vault put held .locks/agentic-tooling.lock for ~6 minutes while pushing the inflated manifest, which timed out a scheduled workflow on a second host sharing the namespace (Workflow execution failed: Lock ".locks/agentic-tooling.lock" ... timed out after 60s). So #788 (object/cache bloat) and #666 (single structural lock) compound each other.

Suggested fix scope. GC's object deletion needs to cover (a) S3 objects and (b) the local repo cache under ~/.swamp/repos/<repoId>/, and the per-artifact garbageCollection / lifetime policy should actually drive pruning. Otherwise retention settings are advisory only and the manifest grows without bound, and any manual cleanup is undone on the next sync.

Environment

swamp 20260617.212026.0 (CLI); older binaries 20260609 / 20260516 also in the mix on a second host
@swamp/s3-datastore (MinIO backend), namespace agentic-tooling, shared across two machines
Affected model: a high-frequency GitHub PR-snapshot poller writing one new version per refresh

stack72 commented 6/24/2026, 3:07:39 PM

Thanks @mgreten for reporting this! The fix has been merged and a release is on its way. We appreciate your contribution to swamp.

mgreten commented 6/24/2026, 7:02:34 PM

Confirming the fix shipped in @swamp/s3-datastore@2026.06.24.1 — thank you for the fast turnaround. Picked it up via swamp extension update @swamp/s3-datastore (the binary swamp update / swamp repo upgrade don't touch the pulled extension, so the lockfile pin had to be bumped explicitly — worth a mention for anyone else tracking the fix).

Will report back once a real data gc runs against the shared backend and confirm the .datastore-index.json manifest actually shrinks (it had grown to ~110 MB / ~262k objects, which was also the dominant lock-hold cost in #666). Early sign is good: post-update the winning sync's lock hold dropped from ~117s to ~50s. Appreciate it.

mgreten commented 6/24/2026, 7:35:50 PM

Verified on @swamp/s3-datastore@2026.06.24.1 — the fix works. Ran a real swamp data gc -f against a namespace that had accumulated a lot of expired poller history:

~104,352 versions / 7,799 entries / ~199 MB reclaimed, all duration-expired.
Follow-up gc --dry-run now reports 0 expired entries — confirming the objects were actually deleted from the backend and the catalog shrank durably (previously this churn never cleared). 👍

One operational note for anyone GC'ing a large backlog: the gc deletions committed to the local catalog fine, but the post-delete push to S3 timed out at the default 300000ms (it had to rewrite the large index + delete the objects in one push), so gc exited non-zero with a trailing push-timeout error even though the deletions were durable. Re-running with SWAMP_DATASTORE_SYNC_TIMEOUT_MS=1800000 set on the command completed cleanly. Might be worth either bumping the implicit-push timeout for gc specifically, or noting in the gc docs that a large first GC needs the env var. (This is the same monolithic-index-under-lock cost as #666.)

Thanks again — great to see this land.

feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

feat(clickhouse): idempotent DDL migration path for running prod (#859 deliverable 2)

chore(clickhouse): retire S3-backed v1 + s3 objects after #859 cutover

Decouple prod ClickHouse from S3 (drop storage_policy=s3_main) + add a DDL migration path

Epic #847 · Unit 6: Document the Mongo-vs-ClickHouse storage-architecture split in scoring.md

Epic #847 · Unit 5: ClickHouse materialized-view projections + atomic leaderboard read-flip + delete Mongo OLAP

Epic #847 · Unit 4: Stream confirmed grants into ClickHouse score_grants (ReplacingMergeTree)

Epic #847 · Unit 3: Migrate the 5 recompute contributions to per-event grants; delete the recompute path

Epic #847 · Unit 2: score_grants append-only ledger write-model in Mongo (shadow, no read flip)

Epic #847 · Unit 1: Land the ClickHouse projection foundation (schema + init SQL + compose service)

Global skills should auto-sync when binary version advances

autoGc emits auto_gc_completed event on --json stdout, breaking single-parse consumers

Extension publish score is non-monotonic: yanking versions lowers a user's score

Live Swamp Club event console on /feed — scrolling stream of all non-sensitive events

Docs: add --ws-idle-timeout to serve flags reference

copy/rsync ignores transport extraOptions (and proxyCommand), unlike exec/script

Remove feed comments — consolidate discussion in Discord

Make serve WebSocket idle/keepalive timeout configurable (untunable default aborts runs when serve's loop briefly blocks)

Docs: update extension info reference with content metadata output

serve startup time regression: synchronous catalog init delays WebSocket listener by ~4.5 minutes

Expose run/job/step identifiers as SWAMP_* env vars + CEL values, and template placement selectors (extends #331's run.id)

fix: add .namespace.json to isInternalCacheFile() in datastore extensions

docs: document swamp serve daemon enable/disable/status subcommands

docs: document execution cancellation commands and cancelled status

docs: document autoGc config option for automatic garbage collection

Docs: document @env= and @file= webhook secret indirection in swamp-serve reference

fix: datastore sync --push deletes the namespace registration manifest (canonical namespace flow un-registers itself)

docs: bundled swamp agent skill lacks datastore-namespace guidance (giga-swamp)

data query --select crashes on BigInt: "Do not know how to serialize a BigInt" when CEL size() reaches the JSON renderer

Intra-namespace write concurrency: whole-index sync under the lock serializes fan-out workloads (split from shipped #666)

Telemetry recoverOrphaned startup race with multiple replicas (created_at-based)

Telemetry retry/failed path has the same non-atomic claim as #820

Batch / prefix delete for swamp data delete (single lock acquisition)

Surface extension type+method detail in CLI to eliminate expensive discovery loops

Skill guides lack progressive reveal boundaries — agents over-read by 4x

Opt-in automatic garbage collection for datastore data

UAT: swamp workflow evaluate/run with forEach dynamic workflowIdOrName targets

Telemetry watcher has no replica coordination: N replicas double-process the same batch (non-atomic find→updateMany claim)

Telemetry drain still capped ~80-100/s in prod: per-username full-history re-aggregate is O(users) sequential per batch (deferred #817 fix #4)

Telemetry ingest is consumer-bound: counter & stats dedup via O(N) sequential insertOne, throughput stuck ~20 events/s regardless of BATCH_SIZE

Resolve dynamic workflow task targets inside forEach

Leaderboard and profile streak not reporting

Same-namespace writers fully serialize on the per-namespace lock — could maintenance/append writes avoid holding it?

Could method-summary report artifacts get a default retention cap? They grow to dominate the datastore manifest

Docs: update vault inspect output in manual reference

Execution cancellation: abort stuck workflow runs and model method runs, bulk cleanup, and daemon-restart reaping

Docs: document .? optional select for null-safe CEL data access

Optional scheduled / automatic datastore GC (retention-policy-driven pruning)

Notify issue author/participants on ripples & status changes — with Discord bot DM as a delivery channel

Batch step 2 of enrichAuthorPlans (per-collective subscription reads)

Datastore should fail fast on unresolvable credentials instead of stalling on the AWS provider chain

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

Add `vault delete` support to @swamp/aws-sm extension

Add `vault delete` support to @swamp/azure-kv extension

Add `vault delete` support to @swamp/1password extension

Leaderboard window baseline: 90-day cutoff zeroes returning-dormant users (latent, 0 impact today)

swamp data gc prunes the catalog but never deletes objects from S3 datastores (markDirty hook not wired) — sync manifest never shrinks

SKILL.md Common Commands: model type search uses wrong command and syntax

SKILL.md Common Commands: model create uses wrong @<type> prefix

swamp issue bug times out posting to the Lab while swamp-club.com returns HTTP 200

telemetry stats fatally fails to load an installed datastore extension (auto-resolve path); all other commands load it fine

tf plan: FETCH_BUNDLE PAGE_FETCH_ERR / NO_STATES on cleanup-only plan (no resource changes)

Add deleteResource to MethodContext and document dataRepository.delete in skills

Homebrew formula

Yank semantics inconsistent: all-versions-yanked acts as a free hidden/private extension; extension-level yank hard-blocks re-push

Extension search returns edit-distance noise for short queries ("asdl" → "AWS DEADLINE")

workflow resume holds the global lock across the resumed step, deadlocking any datastore op the step performs

Trajectory chart: current-day x-axis label is clipped at the right edge

Telemetry not synced to swamp-club: local queue accumulating ~3 days despite valid auth

extension pull serves a stale version that disagrees with search (honors a legacy per-extension serverUrl)

serve --webhook usage string makes <header> look optional for generic scheme

serve: webhook scheme not surfaced in startup event, health endpoint, or log line

Slack webhook pre-body gate only checks signature header, not timestamp

Dead code: verifySignature in webhook.ts superseded by verifier abstraction

extension source: install skills from source-path extensions

Data-driven webhook signature verifiers (avoid a code change + release per provider)

swamp.club extension view: multi-line code fence in manifest description renders each line as a separate inline code span

Single global datastore lock serializes unrelated writes across all repos/namespaces