feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

The telemetry watcher runs unconditionally in every pod — startWatcher at services/telemetry/main.ts:137 has no leader election, lease, or singleton guard (confirmed: no leader/lease/claim coordination anywhere in services/telemetry/). The deployment now runs 2 replicas.

The batch claim is non-atomic: processPending does collection.find({ status: "pending" }).sort().limit() (services/telemetry/lib/watcher.ts:163) and then, as a separate statement, collection.updateMany({ _id: { $in } }, { $set: { status: "processing" } }) (:233). Between the find and the updateMany, a second replica can find the same pending docs — so both replicas mark the same batch processing and both dispatch it through the full consumer fan-out.

Why it doesn't corrupt data — but is still a bug

Correctness is held up entirely by per-consumer idempotency: insert_id dedup (stats/counter/metrics *_processed_inserts), deterministic queue _id (discord/scoring), upsert (extensions), at-least-once (s3/clickhouse). So no double-counts — but:

Wasted duplicate work: both replicas run the entire consumer fan-out for the same events — ~2x Mongo + S3 + ClickHouse load for zero gain.
Duplicate archive inserts: S3 + ClickHouse are at-least-once; clickhouse.ts already notes ~3% dup batches and relies on query-time dedup — this amplifies it.
Mongo write contention: concurrent updateMany(processing) / deleteMany on the same _ids.
Throughput ceiling: at high rate (post-#819) the two watchers race on the head of the FIFO constantly, so effective throughput is well below 2x replicas and a large fraction of work is duplicated. This becomes the next bottleneck the moment the consumers stop being the bottleneck.

Evidence (prod, 2026-06-25)

2 telemetry-api replicas logging batch processing concurrently against the same FIFO — e.g. pod lm8lj batches at 17:22:06 / 17:22:12, pod m6pg2 at 17:22:09 / 17:22:14 (interleaved). No coordination in code.

Proposed (pick one)

Atomic claim — replace find-then-updateMany with a findOneAndUpdate (pending→processing) per doc/batch, or a claimed_by + claimed_at token so a batch is owned by exactly one worker. Keeps horizontal scaling.
Sharded poll — each replica takes a disjoint partition (e.g. hash(_id) % replicaCount), so no two replicas see the same doc. Keeps horizontal scaling.
Leader election / single-writer — a Mongo lease doc with TTL so only one replica runs the watcher; scale reads, not the writer. Simplest, but no write throughput gain from replicas.

Prefer (1) or (2) if the intent of running 2 replicas was throughput.

Environment

prod telemetry-api (DigitalOcean sfo3), 2 replicas, MongoDB Atlas. Related: #817 (throughput scaling), #819 (deferred per-user recompute). This race is the next ceiling once #819 lands.

02Bog Flow

Shipped

6/26/2026, 4:07:12 AM

Click a lifecycle step above to view its details.

03Sludge Pulse

keeb assigned keeb6/25/2026, 5:37:49 PM

keeb commented 6/26/2026, 3:21:01 AM

Follow-ups for the deliberately out-of-scope races identified while fixing this (both still covered by idempotent consumers today): #827 — retry/failed path has the same non-atomic claim; #828 — recoverOrphaned startup race (created_at-based). Fix is in PR #749 (pending path only).

feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

feat(clickhouse): idempotent DDL migration path for running prod (#859 deliverable 2)

chore(clickhouse): retire S3-backed v1 + s3 objects after #859 cutover

Decouple prod ClickHouse from S3 (drop storage_policy=s3_main) + add a DDL migration path

Epic #847 · Unit 6: Document the Mongo-vs-ClickHouse storage-architecture split in scoring.md

Epic #847 · Unit 5: ClickHouse materialized-view projections + atomic leaderboard read-flip + delete Mongo OLAP

Epic #847 · Unit 4: Stream confirmed grants into ClickHouse score_grants (ReplacingMergeTree)

Epic #847 · Unit 3: Migrate the 5 recompute contributions to per-event grants; delete the recompute path

Epic #847 · Unit 2: score_grants append-only ledger write-model in Mongo (shadow, no read flip)

Epic #847 · Unit 1: Land the ClickHouse projection foundation (schema + init SQL + compose service)

Global skills should auto-sync when binary version advances

autoGc emits auto_gc_completed event on --json stdout, breaking single-parse consumers

Extension publish score is non-monotonic: yanking versions lowers a user's score

Live Swamp Club event console on /feed — scrolling stream of all non-sensitive events

Docs: add --ws-idle-timeout to serve flags reference

copy/rsync ignores transport extraOptions (and proxyCommand), unlike exec/script

Remove feed comments — consolidate discussion in Discord

Make serve WebSocket idle/keepalive timeout configurable (untunable default aborts runs when serve's loop briefly blocks)

Docs: update extension info reference with content metadata output

serve startup time regression: synchronous catalog init delays WebSocket listener by ~4.5 minutes

Expose run/job/step identifiers as SWAMP_* env vars + CEL values, and template placement selectors (extends #331's run.id)

fix: add .namespace.json to isInternalCacheFile() in datastore extensions

docs: document swamp serve daemon enable/disable/status subcommands

docs: document execution cancellation commands and cancelled status

docs: document autoGc config option for automatic garbage collection

Docs: document @env= and @file= webhook secret indirection in swamp-serve reference

fix: datastore sync --push deletes the namespace registration manifest (canonical namespace flow un-registers itself)

docs: bundled swamp agent skill lacks datastore-namespace guidance (giga-swamp)

data query --select crashes on BigInt: "Do not know how to serialize a BigInt" when CEL size() reaches the JSON renderer

Intra-namespace write concurrency: whole-index sync under the lock serializes fan-out workloads (split from shipped #666)

Telemetry recoverOrphaned startup race with multiple replicas (created_at-based)

Telemetry retry/failed path has the same non-atomic claim as #820

Batch / prefix delete for swamp data delete (single lock acquisition)

Surface extension type+method detail in CLI to eliminate expensive discovery loops

Skill guides lack progressive reveal boundaries — agents over-read by 4x

Opt-in automatic garbage collection for datastore data

UAT: swamp workflow evaluate/run with forEach dynamic workflowIdOrName targets

Telemetry watcher has no replica coordination: N replicas double-process the same batch (non-atomic find→updateMany claim)

Telemetry drain still capped ~80-100/s in prod: per-username full-history re-aggregate is O(users) sequential per batch (deferred #817 fix #4)

Telemetry ingest is consumer-bound: counter & stats dedup via O(N) sequential insertOne, throughput stuck ~20 events/s regardless of BATCH_SIZE

Resolve dynamic workflow task targets inside forEach

Leaderboard and profile streak not reporting

Same-namespace writers fully serialize on the per-namespace lock — could maintenance/append writes avoid holding it?

Could method-summary report artifacts get a default retention cap? They grow to dominate the datastore manifest

Docs: update vault inspect output in manual reference

Execution cancellation: abort stuck workflow runs and model method runs, bulk cleanup, and daemon-restart reaping

Docs: document .? optional select for null-safe CEL data access

Optional scheduled / automatic datastore GC (retention-policy-driven pruning)

Notify issue author/participants on ripples & status changes — with Discord bot DM as a delivery channel

Batch step 2 of enrichAuthorPlans (per-collective subscription reads)

Datastore should fail fast on unresolvable credentials instead of stalling on the AWS provider chain

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

Add `vault delete` support to @swamp/aws-sm extension

Add `vault delete` support to @swamp/azure-kv extension

Add `vault delete` support to @swamp/1password extension

Leaderboard window baseline: 90-day cutoff zeroes returning-dormant users (latent, 0 impact today)

swamp data gc prunes the catalog but never deletes objects from S3 datastores (markDirty hook not wired) — sync manifest never shrinks

SKILL.md Common Commands: model type search uses wrong command and syntax

SKILL.md Common Commands: model create uses wrong @<type> prefix

swamp issue bug times out posting to the Lab while swamp-club.com returns HTTP 200

telemetry stats fatally fails to load an installed datastore extension (auto-resolve path); all other commands load it fine

tf plan: FETCH_BUNDLE PAGE_FETCH_ERR / NO_STATES on cleanup-only plan (no resource changes)

Add deleteResource to MethodContext and document dataRepository.delete in skills

Homebrew formula

Yank semantics inconsistent: all-versions-yanked acts as a free hidden/private extension; extension-level yank hard-blocks re-push

Extension search returns edit-distance noise for short queries ("asdl" → "AWS DEADLINE")

workflow resume holds the global lock across the resumed step, deadlocking any datastore op the step performs

Trajectory chart: current-day x-axis label is clipped at the right edge

Telemetry not synced to swamp-club: local queue accumulating ~3 days despite valid auth

extension pull serves a stale version that disagrees with search (honors a legacy per-extension serverUrl)

serve --webhook usage string makes <header> look optional for generic scheme

serve: webhook scheme not surfaced in startup event, health endpoint, or log line

Slack webhook pre-body gate only checks signature header, not timestamp

Dead code: verifySignature in webhook.ts superseded by verifier abstraction

extension source: install skills from source-path extensions

Data-driven webhook signature verifiers (avoid a code change + release per provider)

swamp.club extension view: multi-line code fence in manifest description renders each line as a separate inline code span

Single global datastore lock serializes unrelated writes across all repos/namespaces