Skip to main content

DATASTORE ARCHITECTURE

Every model method execution, workflow run, audit event, cached bundle, and encrypted secret passes through one datastore. A single persistence layer makes backup, migration, and sharing atomic operations rather than ad-hoc scripts that chase files across scattered directories.

Source-of-truth definitions -- model files, workflow files, vault configs -- stay in git-tracked directories (models/, workflows/, vaults/) so that version control and code review apply to them. The datastore holds only runtime artifacts: the outputs those definitions produce when executed. See How Swamp Works for the broader execution model.

Why three backend options

The default backend stores data in .swamp/ inside the repository -- no configuration, no network, no external dependencies. This is the right choice for a single operative working on one machine.

An external filesystem path (such as an NFS mount) enables multiple machines to share state without cloud dependencies. The tradeoff is operational: shared filesystems introduce network latency and require mount management.

An S3-compatible backend enables geo-distributed access with object versioning. It is itself a Swamp extension (@swamp/s3-datastore), not a built-in -- the datastore system is extensible by design. See the datastore configuration reference for backend setup.

The catalog database

On top of whatever backend stores the raw files, the datastore maintains a SQLite catalog that indexes all artifacts. The catalog exists because scanning a filesystem (or making S3 LIST calls) for every data.query() or data.latest() would be prohibitively slow. Indexed lookup through SQLite turns those operations into millisecond queries regardless of how many artifacts exist.

The catalog is a cache, not a source of truth. If deleted or corrupted, the datastore rebuilds it from the underlying storage. This self-healing property means the catalog is never synced to remote backends -- each machine builds its own from whatever data it has locally.

Why lazy hydration is the default

When using an S3 backend, the datastore must decide what to download. Eager hydration (downloading everything on startup) guarantees all data is available locally, but most commands touch a small fraction of the total data. Downloading gigabytes of historical model outputs to run one method wastes bandwidth and startup time.

Lazy hydration downloads metadata for catalog visibility but defers content retrieval until something actually reads the artifact. The tradeoff is explicit: queries filtering on attributes content may see null for un-hydrated data. This is a documented limitation -- the alternative (eager sync) trades correctness for performance in the opposite direction.

Per-model locking

When multiple machines share a datastore, concurrent writes to the same model must be serialized. The locking scope is per-model rather than global because a workflow running a deploy model should not block a separate scan model from executing concurrently on another machine. Per-model locks keep contention proportional to actual conflicts.

Global locks exist for structural operations -- garbage collection, model deletion, datastore migration -- where cross-model consistency matters. These are rare and short-lived by design. For why fan-out methods reduce lock contention within a single model, see How Swamp Works.

Namespace partitioning

Namespaces partition a shared datastore by repository. Each namespace gets its own data, definitions, and workflow runs while sharing the underlying storage backend. This is how Giga-Swamp repositories enable multiple teams to work against a single datastore without data collisions. See Giga-Swamp for the full namespace design and the isolation guarantees it provides.

Secrets follow a different path entirely -- they pass through the vault system, which manages encryption and access control independently of the datastore backend. See the vaults reference for how secrets are stored and scoped.