Lab #655: SQLite run tracker subsystem for in-flight run lifecycle

Problem

swamp has no subsystem for tracking the lifecycle of in-flight runs. Today, model method run persists a ModelOutput YAML file with status: "running" at the start of execution and updates it to succeeded or failed on completion. This design has three fundamental problems:

1. Process death leaves orphaned "running" records (#636)

When a method run process dies unexpectedly (V8 OOM, SIGKILL, power failure, laptop sleep), the catch block never executes. The output YAML stays on disk with status: "running" permanently. There is no heartbeat, no TTL, no reaper — nothing to distinguish "crashed 11 hours ago" from "still legitimately running after 11 hours." The existing datastore lock self-heals via heartbeat + TTL, but run records have no equivalent mechanism.

2. Runs are not observable or addressable outside the owning terminal (#519)

A running workflow exists only as a process in a specific terminal. There's no way to check status from another shell, list what's in flight, or cancel a misfired run cooperatively. Multi-user repos on shared machines are blind to each other's runs. Terminal death loses the view even when the underlying work continues.

3. The output YAML is the wrong place to track in-flight state

The YamlOutputRepository has a write-once invariant: output YAML files are written exactly once at completion, and findAllGlobalSince() uses file mtime as a fast pre-filter to skip old files without parsing. Heartbeating into the output YAML would break this invariant and trigger remote datastore sync (S3 PUT) on every heartbeat. It's the wrong storage layer for mutable, high-frequency liveness data.

Prior art

We researched how other execution frameworks handle this:

System	Mechanism	Key insight
Apache Airflow	DB heartbeat column, scheduler zombie scan every 10s	Heartbeat + polling is the standard pattern
Temporal	`RecordActivityTaskHeartbeat` RPC, server-side timeout	Heartbeats can carry progress data for resumable retries
Sidekiq Pro	Redis key with 60s TTL, per-process private queues	Orphaned jobs from dead processes' queues are re-enqueued
Celery	Broker-level unacked message redelivery	Delegates crash detection to transport layer
Terraform	Manual `force-unlock`, no automatic recovery	Deliberately avoids automatic recovery (state corruption risk)

All centralized systems converge on heartbeat + timeout polling. The key design variable is heartbeat interval vs detection latency. swamp's challenge is that it has no long-running supervisor — detection is opportunistic (next CLI invocation).

Proposed solution: SQLite run tracker

A local SQLite database (.swamp/run_tracker.db) that becomes the single source of truth for in-flight run lifecycle. SQLite is already used in the codebase (node:sqlite / DatabaseSync in extension_catalog_store.ts), so this adds no new dependency.

Why SQLite

Cheap heartbeat writes — single-row UPDATE, no YAML serialization, no file creation/deletion per heartbeat
ACID guarantees — no torn reads from concurrent processes
Query semantics — SELECT * FROM active_runs WHERE heartbeat_at < ? finds all stale runs in one call
Local-only — liveness tracking is inherently a local concern; the DB stays out of the remote datastore sync path
Concurrent access — WAL mode handles multiple readers/writers cleanly
Established pattern — mirrors the extension catalog store's use of DatabaseSync

Schema (initial)

CREATE TABLE active_runs (
  id            TEXT PRIMARY KEY,   -- run UUID (matches ModelOutput.id or WorkflowRun.id)
  run_kind      TEXT NOT NULL,      -- 'model_method' | 'workflow'
  model_type    TEXT,               -- normalized model type (for model method runs)
  method_name   TEXT,               -- method name (for model method runs)
  workflow_name TEXT,               -- workflow name (for workflow runs)
  pid           INTEGER NOT NULL,   -- Deno.pid of the owning process
  hostname      TEXT NOT NULL,      -- os.hostname() of the owning machine
  started_at    TEXT NOT NULL,      -- ISO 8601 timestamp
  heartbeat_at  TEXT NOT NULL,      -- ISO 8601 timestamp, updated every N seconds
  status        TEXT NOT NULL DEFAULT 'running'  -- 'running' | 'completed' | 'failed' | 'cancelled'
);

CREATE INDEX idx_active_runs_status ON active_runs(status);
CREATE INDEX idx_active_runs_heartbeat ON active_runs(heartbeat_at);

Lifecycle

Register — on method/workflow start, INSERT a row with pid, hostname, started_at, heartbeat_at = now, status = 'running'
Heartbeat — every 30s during execution, UPDATE active_runs SET heartbeat_at = now WHERE id = ? (trivially cheap)
Complete — on success/failure, UPDATE status to completed or failed, or DELETE the row (design choice — keeping rows enables history/query for #519)
Reap — on next CLI invocation, query for rows where status = 'running' AND heartbeat is stale:
- Same machine (hostname matches): check isProcessDead(pid) first (instant), fall back to heartbeat TTL
- Different machine (hostname mismatch): use heartbeat TTL only
- Mark reaped runs as failed in both the tracker DB and the output YAML

Key design decisions

Don't persist "running" to the output YAML. The run tracker owns the "in-flight" lifecycle. Output YAMLs are only written on terminal states (succeeded/failed), preserving the write-once invariant that findAllGlobalSince() depends on. This means the current flow changes:

Today: create output → markRunning → save YAML → execute → markSucceeded/Failed → save YAML
Proposed: register in tracker → execute with heartbeat → on completion: create output in terminal state → save YAML once → deregister from tracker

Heartbeat interval of 30s with 90s TTL. A process that's alive is never stale. A crashed process is detectable within 90s on same-machine (faster via PID check), or on next invocation from any machine. These defaults should be configurable.

PID check as fast-path, not primary mechanism. isProcessDead() (already in file_lock.ts) gives instant same-machine detection. But it's unreliable cross-machine (remote datastores, shared repos). The heartbeat TTL is the real mechanism; PID is an optimization.

Cross-machine detection. When the hostname in the tracker row differs from the current machine, PID check is meaningless. Detection relies entirely on heartbeat TTL. This is acceptable — the common case is same-machine, and the TTL window (90s) is short enough for cross-machine scenarios.

Extract isProcessDead() to shared utility. Currently private to file_lock.ts. Both the lock system and the run tracker need it. Move to src/infrastructure/runtime/process.ts.

Idempotent reaping. Two processes racing to reap the same stale run must not crash. The reaper should handle "already reaped" as a no-op.

What this enables

Immediate (#636)

Stale "running" model outputs are automatically detected and marked failed on next invocation
No more permanently wedged run records after OOM/crash

Future (#519)

swamp workflow status <run-id> — query the tracker DB
swamp workflow list — SELECT * FROM active_runs with formatting
swamp workflow cancel <run-id> — set a cancellation flag that the heartbeat loop checks
Detached runs — the tracker persists state independently of the terminal
Multi-user visibility — all runs in the repo are queryable

Future (beyond #519)

Run history and analytics (if rows are kept after completion)
Concurrent run limits (check active count before starting)
Stale workflow run detection (same pattern as model method runs)

Scope

This issue covers building the run tracker subsystem and wiring it into model method run as the first consumer:

SQLite run tracker infrastructure (schema, open/close, migrations)
Domain interface (RunTracker repository pattern)
Heartbeat mechanism (interval timer during execution)
Stale run detection and reaping
Wire into modelMethodRun in src/libswamp/models/run.ts
Change output YAML to write-once (terminal states only)
Extract isProcessDead() to shared utility
Tests

Wiring into workflow runs, adding CLI query commands, cancel/detach — those are #519's scope, building on this foundation.

#636 — model method run OOMs at 4GB V8 heap; crash leaves run stuck in "running" (the stale reaping symptom this subsystem fixes)
#519 — persistent, queryable workflow runs (the observability/cancel features this subsystem enables)

Support epoch seconds as the version suffix to avoid push collisions

Implement ephemeral data as in-memory repository

Design: serve authentication & authorization (TLS, OAuth, access control)

Private collectives

extension push: adversarial-review report hash is platform-dependent (macOS vs Linux), so committed reviews never match a cross-OS runner

Can't click on repository from swamp extension search.

extension pull: detect ghost-row conflicts and suggest swamp doctor extensions

`extension quality` false-positive bare-import detection on string literal "flexible"

Remote Execution tutorial

SQLite run tracker subsystem for in-flight run lifecycle

Honour AWS profile default region when 'region' globalArg is omitted

swamp serve does not open the UI

Docs: document --compact flag for model type describe

datastore setup migration relocates and deletes repo-root .swamp/secrets, breaking all local_encryption vault.get

Registry content-type search filter can match versions absent from the displayed extension

Docs: document reports.require failure semantics in the manual (unresolvable required report fails the run)

Reports tab on extension page is non-functional

vault read-secret mixes log output into stdout

Resource-leak test failures on main: extension_rubric_scorer_test and worker_gateway_test

Landing page clips the curl install command instead of rendering the full text

model delete --json output shape doesn't match the documented {deleted, modelId, modelName, artifactsDeleted}

model search --json returns a bare model object instead of {query, results} when exactly one model matches

Extension author gitignore guidance: add .swamp.yaml and CLAUDE.md to recommended excludes

reports.require in workflow YAML does not auto-execute pulled extension reports

Yanked extension version still shown as active on swamp-club.com and in 'extension search'

Add deprecate/yank/unyank actions for own extensions on the web interface

Expose per-run memory/CPU metrics for method & workflow executions

model method run OOMs at 4GB V8 heap on long high-fan-out methods; non-configurable heap + crash leaves run stuck in "running"

Tier-up announcements never fire for direct score contributions (badge awards, feed credits)

CI never runs the discord-bot service test suite

Discord role sync never assigns lower leaderboard tiers (Swamp Baby / Muck Runt / Sludge Whelp)

Docs: document globalArgument input reference validation in workflow validate

Remote execution: UAT coverage for tokens, enrollment, and dispatch

Remote execution: comprehensive reference documentation

doctor extensions: pulled-extension source files reported as orphans that --repair can't evict (nested @swamp/aws/* sibling mis-attribution)

Channel-based publishing for local extension backports

Feature: make extension yank channel-scoped (--channel) so it doesn't nuke every channel

swamp update --setup-auto does not work with bluefin44 (crontab not found, should

Feature: demote / withdraw an extension version from the stable channel

Docs: extension-publish skill guide doesn't cover release channels

workflow validate: resolve model globalArguments expressions against the calling workflow's declared inputs

Add a way to list registered report definitions (report search only lists results)

Inconsistent resource-field accessor: data get returns content, data query and CEL use attributes

Extension API: allow export const extension to add resource specs, or document that it cannot

@swamp/aws/cloudformation: expose StackSet instances, drift, and operations (Cloud Control cannot)

workflow validate: method args with a Zod .default() are treated as required

Profile months overlap in trajectory

model type describe --json: bloated output (40% duplicated specs, no compact mode) and lost method-to-output mapping drive agents to read extension source

UAT: Release channel CLI and adversarial tests

Docs: Add release channel documentation to the manual

Test skill evals with Fable in multi-skill eval tests

Guide users toward filing feature requests when @swamp extensions lack a needed capability

Guide users toward filing feature requests when @swamp extensions are missing features

correcting capitlization of Swamp, Swamp Club, and The Swamp on the swamp-club.com website

API: Add release channel support for extension versions (beta, rc, stable)

Add @swamp/hetzner-cloud/server-types model for availability and project limits

Resident/warm worker mode — `model method run` has ~6s fixed per-invocation overhead that rules out latency-sensitive use

Issue submission returns wrong URL path

Dynamic resource attribute refresh for model instances

`vault put` inside workflow steps still acquires global `.datastore.lock` after #382

extension quality scorer mis-detects quoted phrases in comments as bare imports

extension push: credentials-sensitive-field false positive when .meta({ sensitive: true }) is on a continuation line

Local source-loading should discover a report co-located with its model in a paths.base:manifest extension

extension push: optionally sync the published bundle to the manifest `repository:` (git mirror)

Add 'creek' extension kind for cross-querying external systems alongside swamp data

Docs: vault refresh hooks (--refresh-from, --refresh-ttl, --clear-refresh)

Support CI-friendly adversarial review artifacts for extension push

Deprecate "No slow types" (fast-check) rubric factor on server-side scorer

Registry scorer fails on bare specifiers — mirror CLI fix from #505

Official extension for GitHub repository configuration (environments, variables, secrets)

Feature request: @swamp/tailscale extension

Option to change email in your Swamp Club profile or delete account

Docs: vault reference in manual promotes inline KEY=VALUE as primary example

extension push error for disallowed file types doesn't mention binaries field

swamp help extension omits yank and unyank (machine-readable CLI schema misses real subcommands)

Extension model method execute lacks typed args/context — every author has to use ': any' to unblock tests

@webframp/hashicorp-vault: empty KV engine causes 'data.data.keys is not iterable' on vault put

workflow resume fails to register all extension model types (local and pulled) — "Unknown model type"

model method run: cannot pass arrays, numbers, or booleans via --input

extension push --dry-run --json reports local helper imports as bogus model entries