THE DATA LAYER
Models in Swamp do not call each other. There is no RPC, no shared memory, no
event bus. Instead, every method execution produces versioned data artifacts,
and any other model can reference those artifacts through CEL expressions like
data.latest("model-name", "data-name").attributes.field. The data layer --
also known as The Swamp -- is the integration surface. See
How Swamp Works for the broader
composition model.
Why indirect coupling
Direct model-to-model calls create ordering dependencies: A must know B exists, B must be running, and the two must agree on an interface at call time. Indirect coupling through data removes all three constraints. A model that produces a VPC ID does not know whether zero or ten other models will consume it. A model that reads a VPC ID does not know which model produced it or when. The only shared contract is the data name and its schema, making the system additive: wiring a new model into an existing pipeline requires no changes to the models already there.
Versioning and immutability
Every method execution produces a new version of its data outputs. Previous versions are retained up to garbage collection limits. Data is never updated in place.
Immutability buys three things. Auditability: you can always reconstruct what a
model saw at any point in time. Rollback: reverting to a previous version is a
read, not a write. Safe concurrency: two models writing to the same data name
produce two versions rather than a race condition. data.latest() returns the
most recent version; data.version() retrieves a specific one. The
CEL expressions reference covers the full
retrieval syntax.
Data lifetime and garbage collection
Not all data needs to live forever. Intermediate computation results lose value
the moment the workflow completes; long-lived state needs to persist across
runs. Lifetime policies -- ephemeral, job, workflow, duration, infinite --
express this distinction. swamp data gc enforces both lifetime expiry and
version retention limits. Shorter lifetimes reduce storage and noise in queries;
longer lifetimes preserve history and enable rollback. The
data reference documents the available policies.
Querying across models
data.latest() retrieves a known artifact by name. But some operations need to
discover data rather than reference it -- find all resources tagged with
environment=production, or locate every failed result across a repository.
data.query() in CEL and swamp data query on the CLI search across all stored
data using CEL predicates. The query layer exists because composition is not
always planned in advance: reporting, debugging, and ad-hoc inspection all
require asking questions that no single model anticipated. See the
data reference for query syntax and filtering.
Tags and provenance
Every data artifact carries provenance metadata: which model produced it, which method, which workflow run (if any), and timestamps. Operatives can also attach user-defined tags. Provenance is automatic because manual tracking does not scale -- when an artifact looks wrong, the audit trail connects it to the exact execution that created it without anyone having planned for that investigation.
Sensitive fields and redaction
Some data that passes through models is sensitive -- an API key returned by a cloud provider, a connection string with embedded credentials. Output specs can mark fields as sensitive. Sensitive fields are stored but redacted from CLI output and logs.
Excluding sensitive fields entirely would break downstream models that need
those values. Storing them only in the vault would
require models to coordinate on vault paths rather than data names,
re-introducing the coupling the data layer is designed to avoid. Redaction
preserves the uniform interface -- downstream models read sensitive fields
through data.latest() like any other field -- while keeping secrets out of
terminal sessions and audit logs.