feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

Workflow steps today must reference a pre-existing model instance by modelIdOrName. This forces the workflow author (and ultimately the operator) to manually create every model instance the workflow uses before the first run, with the right --global-arg values.

For an extension that ships a useful "diagnose this whole namespace" workflow (@john/debug-namespace-deep is the motivating case), this means the user can't just run:

swamp extension pull @john/k8s
swamp workflow run @john/debug-namespace-deep --input namespace=my-broken-ns

They first have to do 8 manual swamp model create calls (one per resource type — pod, service, deployment, event, configmap, pvc, secret, netpol), each with the right --global-arg namespace=my-broken-ns, before the workflow will execute. That's far higher friction than the equivalent imperative path (kubectl get …), and it's the single biggest reason an LLM agent given a debugging task chooses raw kubectl over invoking an existing swamp workflow.

Empirically: in our k8s-debug benchmark (https://github.com/systeminit/swamp-benchmark), agents find the @john/debug-namespace-deep workflow but consistently choose to reimplement its diagnosis with model_method primitives because of the prerequisite cost.

Proposed solution

Extend the existing model_method task type to accept a model type + globalArgs as an alternative to a pre-existing instance name. When modelType is given, swamp creates an ephemeral instance for the call (or caches one per workflow run), invokes the method, and tears down (or GCs) when the run completes.

# Today — requires pre-existing `foo-pod` instance
- name: list-pods
  task:
    type: model_method
    modelIdOrName: foo-pod
    methodName: list

# Proposed — workflow author specifies the type + globalArgs;
# swamp instantiates ephemerally for this call.
- name: list-pods
  task:
    type: model_method
    modelType: "@john/pod"           # NEW — alternative to modelIdOrName
    globalArgs:                      # NEW — only used when modelType is given
      namespace: ${{ inputs.namespace }}
    methodName: list

modelIdOrName and modelType are mutually exclusive on a single task. When modelType is used:

A transient instance is created if one with matching (type, globalArgs) doesn't already exist for this workflow run.
It's reused across steps within the same run that match the same (type, globalArgs).
It's torn down at workflow-run completion (or marked for GC under the existing data lifecycle rules).

Data artefacts produced by the ephemeral instance are tagged with the workflow run id (already done) so output lookups via data.findBySpec(...) continue to work — they reference the produced dataNames rather than the (transient) instance name.

Alternatives considered

Workflow bootstrap job using command/shell — works, but requires a one-time swamp model create command/shell shell and pollutes the workflow with shell out / shell quoting / error swallowing. Tried this; it works but feels like a workaround.
Extension manifests declare default instances — the extension ships with a manifest entry like "create a default pod instance of type @john/pod on extension pull." Cleaner than bootstrap-via-shell, but doesn't solve namespace parameterisation (the global arg is fixed at create time, so one instance can only serve one namespace).
Make models accept namespace as a method input rather than a global arg — done in the @john/k8s extension as a backwards-compatible refinement. Helps direct CLI usage and pairs nicely with this proposal (the ephemeral instance can serve any namespace), but on its own still requires an instance to exist before the workflow can run. So this is complementary, not a substitute.

Impact

Workflow authors: ship genuinely zero-prereq workflows. "Run this command, get the diagnosis."
Operators / agents: from swamp extension pull X to swamp workflow run X/foo is one hop, no instance bookkeeping.
LLM agents specifically: removes the largest friction we observe pushing them toward reimplementing diagnosis in shell rather than using the framework.

Empirical context

Benchmarked in https://github.com/systeminit/swamp-benchmark — the k8s-debug-v2 challenge. With the current pre-create requirement, swamp-prompted agents take ~140 turns / ~$1.50 per run for a 4-fault namespace because they spend most turns rebuilding diagnosis from primitives. Agents with the same toolchain but no swamp steer (raw kubectl) finish the same task in ~30 turns / ~$0.30. The throughput delta is largely attributable to the prerequisite-instance step the workflow path forces before any structured diagnosis can run.

02Bog Flow

Closed

5/13/2026, 3:31:35 PM

No activity in this phase yet.

03Sludge Pulse

stack72 commented 5/13/2026, 3:31:34 PM

This is now working - https://github.com/systeminit/swamp/pull/1352

feat(clickhouse): tracked archive→CH backfill tooling (backfill.sql)

feat(clickhouse): idempotent DDL migration path for running prod (#859 deliverable 2)

chore(clickhouse): retire S3-backed v1 + s3 objects after #859 cutover

Decouple prod ClickHouse from S3 (drop storage_policy=s3_main) + add a DDL migration path

Epic #847 · Unit 6: Document the Mongo-vs-ClickHouse storage-architecture split in scoring.md

Epic #847 · Unit 5: ClickHouse materialized-view projections + atomic leaderboard read-flip + delete Mongo OLAP

Epic #847 · Unit 4: Stream confirmed grants into ClickHouse score_grants (ReplacingMergeTree)

Epic #847 · Unit 3: Migrate the 5 recompute contributions to per-event grants; delete the recompute path

Epic #847 · Unit 2: score_grants append-only ledger write-model in Mongo (shadow, no read flip)

Epic #847 · Unit 1: Land the ClickHouse projection foundation (schema + init SQL + compose service)

Global skills should auto-sync when binary version advances

autoGc emits auto_gc_completed event on --json stdout, breaking single-parse consumers

Extension publish score is non-monotonic: yanking versions lowers a user's score

Live Swamp Club event console on /feed — scrolling stream of all non-sensitive events

Docs: add --ws-idle-timeout to serve flags reference

copy/rsync ignores transport extraOptions (and proxyCommand), unlike exec/script

Remove feed comments — consolidate discussion in Discord

Make serve WebSocket idle/keepalive timeout configurable (untunable default aborts runs when serve's loop briefly blocks)

Docs: update extension info reference with content metadata output

serve startup time regression: synchronous catalog init delays WebSocket listener by ~4.5 minutes

Expose run/job/step identifiers as SWAMP_* env vars + CEL values, and template placement selectors (extends #331's run.id)

fix: add .namespace.json to isInternalCacheFile() in datastore extensions

docs: document swamp serve daemon enable/disable/status subcommands

docs: document execution cancellation commands and cancelled status

docs: document autoGc config option for automatic garbage collection

Docs: document @env= and @file= webhook secret indirection in swamp-serve reference

fix: datastore sync --push deletes the namespace registration manifest (canonical namespace flow un-registers itself)

docs: bundled swamp agent skill lacks datastore-namespace guidance (giga-swamp)

data query --select crashes on BigInt: "Do not know how to serialize a BigInt" when CEL size() reaches the JSON renderer

Intra-namespace write concurrency: whole-index sync under the lock serializes fan-out workloads (split from shipped #666)

Telemetry recoverOrphaned startup race with multiple replicas (created_at-based)

Telemetry retry/failed path has the same non-atomic claim as #820

Batch / prefix delete for swamp data delete (single lock acquisition)

Surface extension type+method detail in CLI to eliminate expensive discovery loops

Skill guides lack progressive reveal boundaries — agents over-read by 4x

Opt-in automatic garbage collection for datastore data

UAT: swamp workflow evaluate/run with forEach dynamic workflowIdOrName targets

Telemetry watcher has no replica coordination: N replicas double-process the same batch (non-atomic find→updateMany claim)

Telemetry drain still capped ~80-100/s in prod: per-username full-history re-aggregate is O(users) sequential per batch (deferred #817 fix #4)

Telemetry ingest is consumer-bound: counter & stats dedup via O(N) sequential insertOne, throughput stuck ~20 events/s regardless of BATCH_SIZE

Resolve dynamic workflow task targets inside forEach

Leaderboard and profile streak not reporting

Same-namespace writers fully serialize on the per-namespace lock — could maintenance/append writes avoid holding it?

Could method-summary report artifacts get a default retention cap? They grow to dominate the datastore manifest

Docs: update vault inspect output in manual reference

Execution cancellation: abort stuck workflow runs and model method runs, bulk cleanup, and daemon-restart reaping

Docs: document .? optional select for null-safe CEL data access

Optional scheduled / automatic datastore GC (retention-policy-driven pruning)

Notify issue author/participants on ripples & status changes — with Discord bot DM as a delivery channel

Batch step 2 of enrichAuthorPlans (per-collective subscription reads)

Datastore should fail fast on unresolvable credentials instead of stalling on the AWS provider chain

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

Add SWAMP CLUB wordmark logo next to sc-mark.png in TRADEMARKS.md

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

pushChanged does not implement absence-on-disk deletion (markDirty contract rule #2)

Add `vault delete` support to @swamp/aws-sm extension

Add `vault delete` support to @swamp/azure-kv extension

Add `vault delete` support to @swamp/1password extension

Leaderboard window baseline: 90-day cutoff zeroes returning-dormant users (latent, 0 impact today)

swamp data gc prunes the catalog but never deletes objects from S3 datastores (markDirty hook not wired) — sync manifest never shrinks

SKILL.md Common Commands: model type search uses wrong command and syntax

SKILL.md Common Commands: model create uses wrong @<type> prefix

swamp issue bug times out posting to the Lab while swamp-club.com returns HTTP 200

telemetry stats fatally fails to load an installed datastore extension (auto-resolve path); all other commands load it fine

tf plan: FETCH_BUNDLE PAGE_FETCH_ERR / NO_STATES on cleanup-only plan (no resource changes)

Add deleteResource to MethodContext and document dataRepository.delete in skills

Homebrew formula

Yank semantics inconsistent: all-versions-yanked acts as a free hidden/private extension; extension-level yank hard-blocks re-push

Extension search returns edit-distance noise for short queries ("asdl" → "AWS DEADLINE")

workflow resume holds the global lock across the resumed step, deadlocking any datastore op the step performs

Trajectory chart: current-day x-axis label is clipped at the right edge

Telemetry not synced to swamp-club: local queue accumulating ~3 days despite valid auth

extension pull serves a stale version that disagrees with search (honors a legacy per-extension serverUrl)

serve --webhook usage string makes <header> look optional for generic scheme

serve: webhook scheme not surfaced in startup event, health endpoint, or log line

Slack webhook pre-body gate only checks signature header, not timestamp

Dead code: verifySignature in webhook.ts superseded by verifier abstraction

extension source: install skills from source-path extensions

Data-driven webhook signature verifiers (avoid a code change + release per provider)

swamp.club extension view: multi-line code fence in manifest description renders each line as a separate inline code span

Single global datastore lock serializes unrelated writes across all repos/namespaces