feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

The telemetry watcher's throughput is gated by two consumers (stats, counter) that do per-event sequential MongoDB round-trips. As a result ingest tops out around ~20 events/sec and does not scale with BATCH_SIZE or POLL_INTERVAL_MS — tuning those env vars has no effect.

Impact

On 2026-06-25 the watcher stalled (13:12Z) and accumulated a large pending backlog in telemetry_events. To drain faster we set BATCH_SIZE=500 / POLL_INTERVAL_MS=20 on the prod telemetry-api deployment and rolled it. Measured drain after the change: **21 events/sec** — essentially identical to the default (BATCH_SIZE=50/POLL_INTERVAL_MS=2000 ≈ 25/s). One 500-batch turns over in ~25s, so a ~200k backlog is hours, not minutes. Those env settings are the wrong lever.

Root cause

ConsumerRegistry.dispatchTo runs all consumers in parallel (Promise.allSettled, services/telemetry/lib/consumer.ts:53), so a batch's wall-time = the slowest consumer. The two slowest do work that is linear AND sequential in batch size:

services/telemetry/lib/consumers/stats.ts:121-137 — dedup-inserts does await inserts.insertOne(...) in a for loop, one awaited round-trip per event. The code even comments it is "the typical bottleneck" (stats.ts:107-108). Plus per-user sequential updateOne loops in upsertPerDistinctId (:282), updateIdentityMap (:417, two writes/user), and a full-history re-aggregate per affected username every batch (recomputeUsernameMetricsProjection, :477 — reads every device doc and re-sums).
services/telemetry/lib/consumers/counter.ts:143-161 — same per-event sequential insertOne dedup loop.

Because the work is N sequential awaits, a 10x batch = ~10x the time = same throughput. POLL_INTERVAL_MS=20 is moot since batch processing (~25s) >> the poll gap.

services/telemetry/lib/consumers/metrics.ts:353 already fixed this exact pattern with a single bulkWrite(ordered:false) for dedup (and :415 for rollups). counter and stats were never converted. (clickhouse and s3 are also fine — one bulk call per batch.)

Cross-cutting: dedup is done three separate times against three collections (processed_inserts, counter_processed_inserts, metrics_processed_inserts).

Proposed fixes (ranked)

Convert counter + stats dedup to bulkWrite(ordered:false) — copy the metrics pattern. N round-trips -> 1. Makes BATCH_SIZE actually buy throughput. Low risk; pattern already in-tree.
Dedup once in the watcher before fan-out instead of 3x per-consumer.
Batch stats' per-user writes (bulkWrite over users, not sequential updateOne).
Stop re-aggregating full per-user history every batch (recomputeUsernameMetricsProjection) — go incremental or move off the hot path.
Take derived bookkeeping (per-user metrics, identity map, username projection, milestones) off the FIFO critical path. The archive (s3/clickhouse) is already one-shot; the slowness is all derived state piled into the synchronous fan-out. This also means one stuck consumer freezes the entire pipeline (as happened in the stall above).

02Bog Flow

Shipped

6/25/2026, 4:59:03 PM

Click a lifecycle step above to view its details.

03Sludge Pulse

keeb assigned keeb6/25/2026, 4:01:12 PM

feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

feat(clickhouse): idempotent DDL migration path for running prod (#859 deliverable 2)

chore(clickhouse): retire S3-backed v1 + s3 objects after #859 cutover

Decouple prod ClickHouse from S3 (drop storage_policy=s3_main) + add a DDL migration path

Epic #847 · Unit 6: Document the Mongo-vs-ClickHouse storage-architecture split in scoring.md

Epic #847 · Unit 5: ClickHouse materialized-view projections + atomic leaderboard read-flip + delete Mongo OLAP

Epic #847 · Unit 4: Stream confirmed grants into ClickHouse score_grants (ReplacingMergeTree)

Epic #847 · Unit 3: Migrate the 5 recompute contributions to per-event grants; delete the recompute path

Epic #847 · Unit 2: score_grants append-only ledger write-model in Mongo (shadow, no read flip)

Epic #847 · Unit 1: Land the ClickHouse projection foundation (schema + init SQL + compose service)

Global skills should auto-sync when binary version advances

autoGc emits auto_gc_completed event on --json stdout, breaking single-parse consumers

Extension publish score is non-monotonic: yanking versions lowers a user's score

Live Swamp Club event console on /feed — scrolling stream of all non-sensitive events

Docs: add --ws-idle-timeout to serve flags reference

copy/rsync ignores transport extraOptions (and proxyCommand), unlike exec/script

Remove feed comments — consolidate discussion in Discord

Make serve WebSocket idle/keepalive timeout configurable (untunable default aborts runs when serve's loop briefly blocks)

Docs: update extension info reference with content metadata output

serve startup time regression: synchronous catalog init delays WebSocket listener by ~4.5 minutes

Expose run/job/step identifiers as SWAMP_* env vars + CEL values, and template placement selectors (extends #331's run.id)

fix: add .namespace.json to isInternalCacheFile() in datastore extensions

docs: document swamp serve daemon enable/disable/status subcommands

docs: document execution cancellation commands and cancelled status

docs: document autoGc config option for automatic garbage collection

Docs: document @env= and @file= webhook secret indirection in swamp-serve reference

fix: datastore sync --push deletes the namespace registration manifest (canonical namespace flow un-registers itself)

docs: bundled swamp agent skill lacks datastore-namespace guidance (giga-swamp)

data query --select crashes on BigInt: "Do not know how to serialize a BigInt" when CEL size() reaches the JSON renderer

Intra-namespace write concurrency: whole-index sync under the lock serializes fan-out workloads (split from shipped #666)

Telemetry recoverOrphaned startup race with multiple replicas (created_at-based)

Telemetry retry/failed path has the same non-atomic claim as #820

Batch / prefix delete for swamp data delete (single lock acquisition)

Surface extension type+method detail in CLI to eliminate expensive discovery loops

Skill guides lack progressive reveal boundaries — agents over-read by 4x

Opt-in automatic garbage collection for datastore data

UAT: swamp workflow evaluate/run with forEach dynamic workflowIdOrName targets

Telemetry watcher has no replica coordination: N replicas double-process the same batch (non-atomic find→updateMany claim)

Telemetry drain still capped ~80-100/s in prod: per-username full-history re-aggregate is O(users) sequential per batch (deferred #817 fix #4)

Telemetry ingest is consumer-bound: counter & stats dedup via O(N) sequential insertOne, throughput stuck ~20 events/s regardless of BATCH_SIZE

Resolve dynamic workflow task targets inside forEach

Leaderboard and profile streak not reporting

Same-namespace writers fully serialize on the per-namespace lock — could maintenance/append writes avoid holding it?

Could method-summary report artifacts get a default retention cap? They grow to dominate the datastore manifest

Docs: update vault inspect output in manual reference

Execution cancellation: abort stuck workflow runs and model method runs, bulk cleanup, and daemon-restart reaping

Docs: document .? optional select for null-safe CEL data access

Optional scheduled / automatic datastore GC (retention-policy-driven pruning)

Notify issue author/participants on ripples & status changes — with Discord bot DM as a delivery channel

Batch step 2 of enrichAuthorPlans (per-collective subscription reads)

Datastore should fail fast on unresolvable credentials instead of stalling on the AWS provider chain

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

Add `vault delete` support to @swamp/aws-sm extension

Add `vault delete` support to @swamp/azure-kv extension

Add `vault delete` support to @swamp/1password extension

Leaderboard window baseline: 90-day cutoff zeroes returning-dormant users (latent, 0 impact today)

swamp data gc prunes the catalog but never deletes objects from S3 datastores (markDirty hook not wired) — sync manifest never shrinks

SKILL.md Common Commands: model type search uses wrong command and syntax

SKILL.md Common Commands: model create uses wrong @<type> prefix

swamp issue bug times out posting to the Lab while swamp-club.com returns HTTP 200

telemetry stats fatally fails to load an installed datastore extension (auto-resolve path); all other commands load it fine

tf plan: FETCH_BUNDLE PAGE_FETCH_ERR / NO_STATES on cleanup-only plan (no resource changes)

Add deleteResource to MethodContext and document dataRepository.delete in skills

Homebrew formula

Yank semantics inconsistent: all-versions-yanked acts as a free hidden/private extension; extension-level yank hard-blocks re-push

Extension search returns edit-distance noise for short queries ("asdl" → "AWS DEADLINE")

workflow resume holds the global lock across the resumed step, deadlocking any datastore op the step performs

Trajectory chart: current-day x-axis label is clipped at the right edge

Telemetry not synced to swamp-club: local queue accumulating ~3 days despite valid auth

extension pull serves a stale version that disagrees with search (honors a legacy per-extension serverUrl)

serve --webhook usage string makes <header> look optional for generic scheme

serve: webhook scheme not surfaced in startup event, health endpoint, or log line

Slack webhook pre-body gate only checks signature header, not timestamp

Dead code: verifySignature in webhook.ts superseded by verifier abstraction

extension source: install skills from source-path extensions

Data-driven webhook signature verifiers (avoid a code change + release per provider)

swamp.club extension view: multi-line code fence in manifest description renders each line as a separate inline code span

Single global datastore lock serializes unrelated writes across all repos/namespaces