feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

Datastore data versions accumulate indefinitely because garbage collection only runs when the user manually invokes swamp data gc. For high-churn datastores — particularly those with frequent model method runs — this causes the catalog export (.catalog-export.json) to grow unboundedly, increasing lock-hold time on every write in the namespace.

This issue consolidates two related requests:

#806 — Optional scheduled/automatic datastore GC for duration-expired data on high-churn S3 datastores, so users don't have to hand-roll their own swamp data gc scheduler wrappers.
#811 — Auto-generated method-summary report artifacts (garbageCollection: 5, lifetime: 30d) dominate the catalog export because their GC policy is never enforced automatically. One profiled namespace had 109,152 catalog rows / 156.6 MB tracked, but only 13,884 rows (7.1 MB) were current state — ~95% was non-latest versions, with method-summary artifacts alone accounting for ~108 MB.

Both issues share the same root cause: GC policies are declared on data metadata but only enforced during manual swamp data gc runs. The catalog export serializes every row (all versions) to JSON on every push, so accumulated versions directly increase lock-hold time for all writers in the namespace.

Why the catalog export matters

Every datastore write triggers a push of .catalog-export.json, which is a full JSON dump of all catalog rows in the namespace. More accumulated versions → more rows → bigger export → longer lock-hold during push. Even unrelated writes (vault operations, small state updates) pay the full cost of the bloated export.

Proposed approach

Add an opt-in automatic GC mechanism that enforces existing GC policies (version-count caps like garbageCollection: 5 and duration-based retention like lifetime: 30d) without requiring manual intervention. This should be opt-in rather than default behavior, since automatic data deletion is a significant behavioral change.

Possible entry points:

Repository-level config (.swamp.yaml) — e.g. autoGc: true or autoGc: { afterMethodRun: true } to enable GC after method runs
Scheduled/periodic GC — a built-in scheduler or datastore-level config for periodic GC runs
Per-write inline GC — GC scoped to the specific data items just written, triggered after each write

Data and catalog entries must be pruned together to avoid orphaning data files on remote datastores.

Context from triage

During triage of #811, we explored several approaches:

Inline GC after report writes — effective but risky as a default; deletes data automatically on every method run
Latest-only catalog export — would shrink the export but orphans data files on remote datastores since the catalog no longer references non-latest versions
Overwrite instead of version — eliminates accumulation for reports but loses version history entirely

The opt-in auto-GC approach was chosen because it addresses both issues cleanly, keeps data and catalog in sync, and lets users control when automatic deletion is acceptable.

Environment

swamp: 20260624.181631.0-sha.aa2ae00f
Relevant to all datastore backends (local filesystem, S3)

02Bog Flow

Shipped

6/26/2026, 9:47:52 AM

Click a lifecycle step above to view its details.

03Sludge Pulse

stack72 assigned stack726/25/2026, 9:57:07 PM

mgreten commented 6/26/2026, 2:02:44 AM

Excited for this one — thanks for picking it up. While profiling our own datastore to prep for auto-GC, I turned up something that might be worth folding into the design: swamp data gc currently doesn't collect ephemeral-lifetime data.

Concretely, on @swamp/s3-datastore@2026.06.24.1 (swamp 20260625.225837.0):

One of our models writes a small per-call result artifact declared lifetime: "ephemeral" with garbageCollection: 50.
These accumulated to 2,559 distinct ephemeral artifacts that were never collected.
A swamp data gc --dry-run reported dataEntriesExpired: 8350 and would reclaim ~128 MB — but 0 of those 2,559 ephemeral artifacts were in the expired set. GC reclaimed expired duration data and version history, but left the ephemeral artifacts entirely untouched.

So today ephemeral doesn't appear to translate into anything GC acts on — an ephemeral artifact lives as long as a lifetime: infinite one. If auto-GC is going to be the mechanism that keeps high-churn datastores lean, it might be worth having it (or data gc) treat ephemeral as collectable — e.g. collect ephemeral artifacts not referenced by the latest run, or honor a max-age/garbageCollection count for them. That would let models opt unimportant per-run scratch data out of the index cleanly, which is exactly the accumulation pattern that bloats the catalog export.

Totally possible this is already in scope for #823 — just flagging the measurement in case it's useful for the design. Happy to share more profiling detail if it'd help, and thanks again for the auto-GC work.

mgreten commented 6/26/2026, 5:21:15 PM

Thanks for shipping this — picked it up right away and I'm hoping it'll help, but I want to make sure I'm enabling it correctly before I trust it.

I set autoGc: true as a top-level key in .swamp.yaml (it sits as a sibling of gitignoreManaged / swampSha, and it survives swamp repo upgrade un-stripped, so it looks recognized). Both my writer machines are on 20260626.102849.0 with the key set.

But I can't tell whether it's actually firing. Peeking at the binary, the GC call that emits auto_gc_completed looks gated on a per-invocation input.autoGc rather than directly on the repo-config field — and swamp model method run doesn't seem to expose an --auto-gc flag. So I'm unsure whether the top-level autoGc: true config propagates into that per-run input automatically, or whether it needs to be set some other way.

Two quick questions when you have a moment (no rush):

Is top-level autoGc: true in .swamp.yaml the intended enable, or does it belong somewhere else (e.g. nested under datastore:, or per-model)?
Is there a way to confirm it's running — a log line / event at a particular log level, or a field in the method-run output I should look for? I tried watching --log output for auto_gc_completed but my namespace is busy enough that I couldn't get a clean observation.

Context for why I care: this is the high-churn / catalog-bloat namespace from the earlier profiling (the one where the index dominates the per-write lock-hold). Auto-GC keeping that index from re-growing between runs is exactly what I'm after. Happy to share whatever would help confirm it. Thanks again.

feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

feat(clickhouse): idempotent DDL migration path for running prod (#859 deliverable 2)

chore(clickhouse): retire S3-backed v1 + s3 objects after #859 cutover

Decouple prod ClickHouse from S3 (drop storage_policy=s3_main) + add a DDL migration path

Epic #847 · Unit 6: Document the Mongo-vs-ClickHouse storage-architecture split in scoring.md

Epic #847 · Unit 5: ClickHouse materialized-view projections + atomic leaderboard read-flip + delete Mongo OLAP

Epic #847 · Unit 4: Stream confirmed grants into ClickHouse score_grants (ReplacingMergeTree)

Epic #847 · Unit 3: Migrate the 5 recompute contributions to per-event grants; delete the recompute path

Epic #847 · Unit 2: score_grants append-only ledger write-model in Mongo (shadow, no read flip)

Epic #847 · Unit 1: Land the ClickHouse projection foundation (schema + init SQL + compose service)

Global skills should auto-sync when binary version advances

autoGc emits auto_gc_completed event on --json stdout, breaking single-parse consumers

Extension publish score is non-monotonic: yanking versions lowers a user's score

Live Swamp Club event console on /feed — scrolling stream of all non-sensitive events

Docs: add --ws-idle-timeout to serve flags reference

copy/rsync ignores transport extraOptions (and proxyCommand), unlike exec/script

Remove feed comments — consolidate discussion in Discord

Make serve WebSocket idle/keepalive timeout configurable (untunable default aborts runs when serve's loop briefly blocks)

Docs: update extension info reference with content metadata output

serve startup time regression: synchronous catalog init delays WebSocket listener by ~4.5 minutes

Expose run/job/step identifiers as SWAMP_* env vars + CEL values, and template placement selectors (extends #331's run.id)

fix: add .namespace.json to isInternalCacheFile() in datastore extensions

docs: document swamp serve daemon enable/disable/status subcommands

docs: document execution cancellation commands and cancelled status

docs: document autoGc config option for automatic garbage collection

Docs: document @env= and @file= webhook secret indirection in swamp-serve reference

fix: datastore sync --push deletes the namespace registration manifest (canonical namespace flow un-registers itself)

docs: bundled swamp agent skill lacks datastore-namespace guidance (giga-swamp)

data query --select crashes on BigInt: "Do not know how to serialize a BigInt" when CEL size() reaches the JSON renderer

Intra-namespace write concurrency: whole-index sync under the lock serializes fan-out workloads (split from shipped #666)

Telemetry recoverOrphaned startup race with multiple replicas (created_at-based)

Telemetry retry/failed path has the same non-atomic claim as #820

Batch / prefix delete for swamp data delete (single lock acquisition)

Surface extension type+method detail in CLI to eliminate expensive discovery loops

Skill guides lack progressive reveal boundaries — agents over-read by 4x

Opt-in automatic garbage collection for datastore data

UAT: swamp workflow evaluate/run with forEach dynamic workflowIdOrName targets

Telemetry watcher has no replica coordination: N replicas double-process the same batch (non-atomic find→updateMany claim)

Telemetry drain still capped ~80-100/s in prod: per-username full-history re-aggregate is O(users) sequential per batch (deferred #817 fix #4)

Telemetry ingest is consumer-bound: counter & stats dedup via O(N) sequential insertOne, throughput stuck ~20 events/s regardless of BATCH_SIZE

Resolve dynamic workflow task targets inside forEach

Leaderboard and profile streak not reporting

Same-namespace writers fully serialize on the per-namespace lock — could maintenance/append writes avoid holding it?

Could method-summary report artifacts get a default retention cap? They grow to dominate the datastore manifest

Docs: update vault inspect output in manual reference

Execution cancellation: abort stuck workflow runs and model method runs, bulk cleanup, and daemon-restart reaping

Docs: document .? optional select for null-safe CEL data access

Optional scheduled / automatic datastore GC (retention-policy-driven pruning)

Notify issue author/participants on ripples & status changes — with Discord bot DM as a delivery channel

Batch step 2 of enrichAuthorPlans (per-collective subscription reads)

Datastore should fail fast on unresolvable credentials instead of stalling on the AWS provider chain

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

Add `vault delete` support to @swamp/aws-sm extension

Add `vault delete` support to @swamp/azure-kv extension

Add `vault delete` support to @swamp/1password extension

Leaderboard window baseline: 90-day cutoff zeroes returning-dormant users (latent, 0 impact today)

swamp data gc prunes the catalog but never deletes objects from S3 datastores (markDirty hook not wired) — sync manifest never shrinks

SKILL.md Common Commands: model type search uses wrong command and syntax

SKILL.md Common Commands: model create uses wrong @<type> prefix

swamp issue bug times out posting to the Lab while swamp-club.com returns HTTP 200

telemetry stats fatally fails to load an installed datastore extension (auto-resolve path); all other commands load it fine

tf plan: FETCH_BUNDLE PAGE_FETCH_ERR / NO_STATES on cleanup-only plan (no resource changes)

Add deleteResource to MethodContext and document dataRepository.delete in skills

Homebrew formula

Yank semantics inconsistent: all-versions-yanked acts as a free hidden/private extension; extension-level yank hard-blocks re-push

Extension search returns edit-distance noise for short queries ("asdl" → "AWS DEADLINE")

workflow resume holds the global lock across the resumed step, deadlocking any datastore op the step performs

Trajectory chart: current-day x-axis label is clipped at the right edge

Telemetry not synced to swamp-club: local queue accumulating ~3 days despite valid auth

extension pull serves a stale version that disagrees with search (honors a legacy per-extension serverUrl)

serve --webhook usage string makes <header> look optional for generic scheme

serve: webhook scheme not surfaced in startup event, health endpoint, or log line

Slack webhook pre-body gate only checks signature header, not timestamp

Dead code: verifySignature in webhook.ts superseded by verifier abstraction

extension source: install skills from source-path extensions

Data-driven webhook signature verifiers (avoid a code change + release per provider)

swamp.club extension view: multi-line code fence in manifest description renders each line as a separate inline code span

Single global datastore lock serializes unrelated writes across all repos/namespaces