Relationships
⊘ blocks #636#655 SQLite run tracker subsystem for in-flight run lifecycle
Opened by stack72 · 6/15/2026
Problem
swamp has no subsystem for tracking the lifecycle of in-flight runs. Today, model method run persists a ModelOutput YAML file with status: "running" at the start of execution and updates it to succeeded or failed on completion. This design has three fundamental problems:
1. Process death leaves orphaned "running" records (#636)
When a method run process dies unexpectedly (V8 OOM, SIGKILL, power failure, laptop sleep), the catch block never executes. The output YAML stays on disk with status: "running" permanently. There is no heartbeat, no TTL, no reaper — nothing to distinguish "crashed 11 hours ago" from "still legitimately running after 11 hours." The existing datastore lock self-heals via heartbeat + TTL, but run records have no equivalent mechanism.
2. Runs are not observable or addressable outside the owning terminal (#519)
A running workflow exists only as a process in a specific terminal. There's no way to check status from another shell, list what's in flight, or cancel a misfired run cooperatively. Multi-user repos on shared machines are blind to each other's runs. Terminal death loses the view even when the underlying work continues.
3. The output YAML is the wrong place to track in-flight state
The YamlOutputRepository has a write-once invariant: output YAML files are written exactly once at completion, and findAllGlobalSince() uses file mtime as a fast pre-filter to skip old files without parsing. Heartbeating into the output YAML would break this invariant and trigger remote datastore sync (S3 PUT) on every heartbeat. It's the wrong storage layer for mutable, high-frequency liveness data.
Prior art
We researched how other execution frameworks handle this:
| System | Mechanism | Key insight |
|---|---|---|
| Apache Airflow | DB heartbeat column, scheduler zombie scan every 10s | Heartbeat + polling is the standard pattern |
| Temporal | RecordActivityTaskHeartbeat RPC, server-side timeout |
Heartbeats can carry progress data for resumable retries |
| Sidekiq Pro | Redis key with 60s TTL, per-process private queues | Orphaned jobs from dead processes' queues are re-enqueued |
| Celery | Broker-level unacked message redelivery | Delegates crash detection to transport layer |
| Terraform | Manual force-unlock, no automatic recovery |
Deliberately avoids automatic recovery (state corruption risk) |
All centralized systems converge on heartbeat + timeout polling. The key design variable is heartbeat interval vs detection latency. swamp's challenge is that it has no long-running supervisor — detection is opportunistic (next CLI invocation).
Proposed solution: SQLite run tracker
A local SQLite database (.swamp/run_tracker.db) that becomes the single source of truth for in-flight run lifecycle. SQLite is already used in the codebase (node:sqlite / DatabaseSync in extension_catalog_store.ts), so this adds no new dependency.
Why SQLite
- Cheap heartbeat writes — single-row UPDATE, no YAML serialization, no file creation/deletion per heartbeat
- ACID guarantees — no torn reads from concurrent processes
- Query semantics —
SELECT * FROM active_runs WHERE heartbeat_at < ?finds all stale runs in one call - Local-only — liveness tracking is inherently a local concern; the DB stays out of the remote datastore sync path
- Concurrent access — WAL mode handles multiple readers/writers cleanly
- Established pattern — mirrors the extension catalog store's use of
DatabaseSync
Schema (initial)
CREATE TABLE active_runs (
id TEXT PRIMARY KEY, -- run UUID (matches ModelOutput.id or WorkflowRun.id)
run_kind TEXT NOT NULL, -- 'model_method' | 'workflow'
model_type TEXT, -- normalized model type (for model method runs)
method_name TEXT, -- method name (for model method runs)
workflow_name TEXT, -- workflow name (for workflow runs)
pid INTEGER NOT NULL, -- Deno.pid of the owning process
hostname TEXT NOT NULL, -- os.hostname() of the owning machine
started_at TEXT NOT NULL, -- ISO 8601 timestamp
heartbeat_at TEXT NOT NULL, -- ISO 8601 timestamp, updated every N seconds
status TEXT NOT NULL DEFAULT 'running' -- 'running' | 'completed' | 'failed' | 'cancelled'
);
CREATE INDEX idx_active_runs_status ON active_runs(status);
CREATE INDEX idx_active_runs_heartbeat ON active_runs(heartbeat_at);Lifecycle
- Register — on method/workflow start, INSERT a row with
pid,hostname,started_at,heartbeat_at = now,status = 'running' - Heartbeat — every 30s during execution,
UPDATE active_runs SET heartbeat_at = now WHERE id = ?(trivially cheap) - Complete — on success/failure, UPDATE status to
completedorfailed, or DELETE the row (design choice — keeping rows enables history/query for #519) - Reap — on next CLI invocation, query for rows where
status = 'running'AND heartbeat is stale:- Same machine (hostname matches): check
isProcessDead(pid)first (instant), fall back to heartbeat TTL - Different machine (hostname mismatch): use heartbeat TTL only
- Mark reaped runs as
failedin both the tracker DB and the output YAML
- Same machine (hostname matches): check
Key design decisions
Don't persist "running" to the output YAML. The run tracker owns the "in-flight" lifecycle. Output YAMLs are only written on terminal states (succeeded/failed), preserving the write-once invariant that findAllGlobalSince() depends on. This means the current flow changes:
- Today: create output → markRunning → save YAML → execute → markSucceeded/Failed → save YAML
- Proposed: register in tracker → execute with heartbeat → on completion: create output in terminal state → save YAML once → deregister from tracker
Heartbeat interval of 30s with 90s TTL. A process that's alive is never stale. A crashed process is detectable within 90s on same-machine (faster via PID check), or on next invocation from any machine. These defaults should be configurable.
PID check as fast-path, not primary mechanism. isProcessDead() (already in file_lock.ts) gives instant same-machine detection. But it's unreliable cross-machine (remote datastores, shared repos). The heartbeat TTL is the real mechanism; PID is an optimization.
Cross-machine detection. When the hostname in the tracker row differs from the current machine, PID check is meaningless. Detection relies entirely on heartbeat TTL. This is acceptable — the common case is same-machine, and the TTL window (90s) is short enough for cross-machine scenarios.
Extract isProcessDead() to shared utility. Currently private to file_lock.ts. Both the lock system and the run tracker need it. Move to src/infrastructure/runtime/process.ts.
Idempotent reaping. Two processes racing to reap the same stale run must not crash. The reaper should handle "already reaped" as a no-op.
What this enables
Immediate (#636)
- Stale "running" model outputs are automatically detected and marked failed on next invocation
- No more permanently wedged run records after OOM/crash
Future (#519)
swamp workflow status <run-id>— query the tracker DBswamp workflow list—SELECT * FROM active_runswith formattingswamp workflow cancel <run-id>— set a cancellation flag that the heartbeat loop checks- Detached runs — the tracker persists state independently of the terminal
- Multi-user visibility — all runs in the repo are queryable
Future (beyond #519)
- Run history and analytics (if rows are kept after completion)
- Concurrent run limits (check active count before starting)
- Stale workflow run detection (same pattern as model method runs)
Scope
This issue covers building the run tracker subsystem and wiring it into model method run as the first consumer:
- SQLite run tracker infrastructure (schema, open/close, migrations)
- Domain interface (RunTracker repository pattern)
- Heartbeat mechanism (interval timer during execution)
- Stale run detection and reaping
- Wire into
modelMethodRuninsrc/libswamp/models/run.ts - Change output YAML to write-once (terminal states only)
- Extract
isProcessDead()to shared utility - Tests
Wiring into workflow runs, adding CLI query commands, cancel/detach — those are #519's scope, building on this foundation.
Related issues
- #636 — model method run OOMs at 4GB V8 heap; crash leaves run stuck in "running" (the stale reaping symptom this subsystem fixes)
- #519 — persistent, queryable workflow runs (the observability/cancel features this subsystem enables)
Open
No activity in this phase yet.
stack72 commented 6/15/2026, 9:25:55 PM
Architecture note: the RunTracker domain interface (register, heartbeat, complete, fail, findStale, reap) must be implementation-agnostic — no SQLite types or concepts in the domain layer. The SQLite backend is the first implementation, but the interface must be swappable for a server-backed implementation (HTTP/WebSocket/gRPC) when a central server component is built. This follows the same pattern as DistributedLock (domain interface) → FileLock (local implementation) / S3 lock (remote implementation). The modelMethodRun code and heartbeat loop should depend only on the domain interface, never on SQLite directly.
Sign in to post a ripple.