Skip to main content

THE DATA LAYER

Models in Swamp do not call each other. There is no RPC, no shared memory, no event bus. Instead, every method execution produces versioned data artifacts, and any other model can reference those artifacts through CEL expressions like data.latest("model-name", "data-name").attributes.field. The data layer -- also known as The Swamp -- is the integration surface. See How Swamp Works for the broader composition model.

Why indirect coupling

Direct model-to-model calls create ordering dependencies: A must know B exists, B must be running, and the two must agree on an interface at call time. Indirect coupling through data removes all three constraints. A model that produces a VPC ID does not know whether zero or ten other models will consume it. A model that reads a VPC ID does not know which model produced it or when. The only shared contract is the data name and its schema, making the system additive: wiring a new model into an existing pipeline requires no changes to the models already there.

Versioning and immutability

Every method execution produces a new version of its data outputs. Previous versions are retained up to garbage collection limits. Data is never updated in place.

Immutability buys three things. Auditability: you can always reconstruct what a model saw at any point in time. Rollback: reverting to a previous version is a read, not a write. Safe concurrency: two models writing to the same data name produce two versions rather than a race condition. data.latest() returns the most recent version; data.version() retrieves a specific one. The CEL expressions reference covers the full retrieval syntax.

Data lifetime and garbage collection

Not all data needs to live forever. Intermediate computation results lose value the moment the workflow completes; long-lived state needs to persist across runs. Lifetime policies -- ephemeral, job, workflow, duration, infinite -- express this distinction. swamp data gc enforces both lifetime expiry and version retention limits. Shorter lifetimes reduce storage and noise in queries; longer lifetimes preserve history and enable rollback. The data reference documents the available policies.

Querying across models

data.latest() retrieves a known artifact by name. But some operations need to discover data rather than reference it -- find all resources tagged with environment=production, or locate every failed result across a repository. data.query() in CEL and swamp data query on the CLI search across all stored data using CEL predicates. The query layer exists because composition is not always planned in advance: reporting, debugging, and ad-hoc inspection all require asking questions that no single model anticipated. See the data reference for query syntax and filtering.

Tags and provenance

Every data artifact carries provenance metadata: which model produced it, which method, which workflow run (if any), and timestamps. Operatives can also attach user-defined tags. Provenance is automatic because manual tracking does not scale -- when an artifact looks wrong, the audit trail connects it to the exact execution that created it without anyone having planned for that investigation.

Sensitive fields and redaction

Some data that passes through models is sensitive -- an API key returned by a cloud provider, a connection string with embedded credentials. Output specs can mark fields as sensitive. Sensitive fields are stored but redacted from CLI output and logs.

Excluding sensitive fields entirely would break downstream models that need those values. Storing them only in the vault would require models to coordinate on vault paths rather than data names, re-introducing the coupling the data layer is designed to avoid. Redaction preserves the uniform interface -- downstream models read sensitive fields through data.latest() like any other field -- while keeping secrets out of terminal sessions and audit logs.