DATASTORE ARCHITECTURE
Every model method execution, workflow run, audit event, cached bundle, and encrypted secret passes through one datastore. A single persistence layer makes backup, migration, and sharing atomic operations rather than ad-hoc scripts that chase files across scattered directories.
Source-of-truth definitions -- model files, workflow files, vault configs --
stay in git-tracked directories (models/, workflows/, vaults/) so that
version control and code review apply to them. The datastore holds only runtime
artifacts: the outputs those definitions produce when executed. See
How Swamp Works for the broader execution
model.
Why three backend options
The default backend stores data in .swamp/ inside the repository -- no
configuration, no network, no external dependencies. This is the right choice
for a single operative working on one machine.
An external filesystem path (such as an NFS mount) enables multiple machines to share state without cloud dependencies. The tradeoff is operational: shared filesystems introduce network latency and require mount management.
An S3-compatible backend enables geo-distributed access with object versioning.
It is itself a Swamp extension (@swamp/s3-datastore), not a built-in -- the
datastore system is extensible by design. See the
datastore configuration reference
for backend setup.
The catalog database
On top of whatever backend stores the raw files, the datastore maintains a
SQLite catalog that indexes all artifacts. The catalog exists because scanning a
filesystem (or making S3 LIST calls) for every data.query() or data.latest()
would be prohibitively slow. Indexed lookup through SQLite turns those
operations into millisecond queries regardless of how many artifacts exist.
The catalog is a cache, not a source of truth. If deleted or corrupted, the datastore rebuilds it from the underlying storage. This self-healing property means the catalog is never synced to remote backends -- each machine builds its own from whatever data it has locally.
Why lazy hydration is the default
When using an S3 backend, the datastore must decide what to download. Eager hydration (downloading everything on startup) guarantees all data is available locally, but most commands touch a small fraction of the total data. Downloading gigabytes of historical model outputs to run one method wastes bandwidth and startup time.
Lazy hydration downloads metadata for catalog visibility but defers content
retrieval until something actually reads the artifact. The tradeoff is explicit:
queries filtering on attributes content may see null for un-hydrated data.
This is a documented limitation -- the alternative (eager sync) trades
correctness for performance in the opposite direction.
Per-model locking
When multiple machines share a datastore, concurrent writes to the same model must be serialized. The locking scope is per-model rather than global because a workflow running a deploy model should not block a separate scan model from executing concurrently on another machine. Per-model locks keep contention proportional to actual conflicts.
Global locks exist for structural operations -- garbage collection, model deletion, datastore migration -- where cross-model consistency matters. These are rare and short-lived by design. For why fan-out methods reduce lock contention within a single model, see How Swamp Works.
Namespace partitioning
Namespaces partition a shared datastore by repository. Each namespace gets its own data, definitions, and workflow runs while sharing the underlying storage backend. This is how Giga-Swamp repositories enable multiple teams to work against a single datastore without data collisions. See Giga-Swamp for the full namespace design and the isolation guarantees it provides.
Secrets follow a different path entirely -- they pass through the vault system, which manages encryption and access control independently of the datastore backend. See the vaults reference for how secrets are stored and scoped.