DATA LIFETIMES
Every data artifact in Swamp has a lifetime that determines when it becomes eligible for garbage collection. The choice is not about storage mechanics — it expresses what the data means and how long it remains true.
The spectrum
Five lifetimes span from no durability to permanent retention:
| Lifetime | Survives process exit | Survives workflow end | Survives indefinitely |
|---|---|---|---|
ephemeral |
No | No | No |
job |
Yes | No | No |
workflow |
Yes | No | No |
| Duration | Yes | Yes | No |
infinite |
Yes | Yes | Yes |
The leftmost column is the important one. Ephemeral data exists only in memory for the duration of the process. Everything else is written to disk and survives restarts. The question is how long it stays before garbage collection removes it.
When ephemeral fits
Ephemeral data is for values that are true right now and meaningless moments later. The canonical examples are dynamic infrastructure inventories: the set of running Kubernetes pods, the current members of an AWS Auto Scaling group, the active containers in a Docker Swarm service. These change constantly. Storing yesterday's pod list does not tell you what is running today — it tells you what was running, which is a different question served by audit logs, not by data artifacts.
The same reasoning applies to intermediate computation results in multi-step workflows. When step one fetches raw API responses and step two aggregates them, the raw responses have served their purpose once the aggregation is complete. Persisting them creates storage debt and noise in queries without adding value.
The tradeoff is durability. If the process crashes mid-workflow, ephemeral data
is lost and the workflow cannot resume from where it stopped — it must restart
from scratch. For workflows where each step takes seconds, this is acceptable.
For workflows with expensive steps (long-running scans, rate-limited API calls),
consider workflow lifetime instead, which survives crashes but is still
cleaned up when the run ends.
Ephemeral data has a memory budget — 512 MB by default, configurable via
SWAMP_EPHEMERAL_BUDGET. This prevents a misbehaving step from consuming all
available memory. The budget is a signal: if your intermediate data routinely
approaches it, the data may be too large for in-memory storage and a disk-backed
lifetime (workflow or a short duration) may be more appropriate.
When job and workflow fit
job and workflow lifetimes scope data to the execution that produced it.
They are written to disk, so they survive process crashes and support workflow
resume. Once the workflow run is cleaned up, the data goes with it.
These lifetimes suit artifacts that need to be durable within a run but have no value between runs. Test results within a CI pipeline, deployment manifests generated for a specific release, temporary credentials rotated per execution — all are tied to a specific run and should not outlive it.
The distinction between job and workflow matters in multi-job workflows. A
job-scoped artifact is garbage-collected when its job completes, even if the
workflow continues. A workflow-scoped artifact survives until the entire run
ends. Choose job when a later job in the same workflow should not see the
data; choose workflow when it should.
When duration fits
Duration lifetimes (1h, 7d, 30d, 1y) express data that is valid for a
bounded window. The window is wall-clock time from creation, not tied to any
execution.
This fits reporting and compliance use cases. A weekly security scan produces a
report that is actionable for about seven days — after the next scan runs, the
old report is superseded. Setting lifetime: 7d means the report is available
for review throughout the week and is automatically cleaned up when the next one
arrives. No manual deletion, no unbounded storage growth.
Duration also fits caching patterns. A model that resolves DNS records might
cache results with lifetime: 1h. Downstream models that read the cached data
within the hour get a fast lookup; after the hour, the stale data is collected
and the next query triggers a fresh resolution.
The choice of duration is a judgement call that depends on how quickly the data loses relevance. Shorter durations reduce storage cost and query noise. Longer durations preserve history for rollback and investigation. There is no universal right answer — it depends on the domain.
When infinite fits
infinite lifetime means the data is never automatically deleted. It persists
until someone explicitly removes it or the repository is destroyed.
This is the right choice for data that serves as a historical record or a long-lived configuration artifact. The result of a model that provisions a VPC should be infinite — the VPC ID is referenced by other models indefinitely, and deleting the data would break those references. Audit logs, compliance evidence, and configuration baselines all belong at infinite lifetime.
The cost is storage. Every method execution that writes to an infinite-lifetime
output creates a new version that is retained until garbage collection prunes
old versions (controlled by the garbageCollection policy). For models that run
frequently, this can accumulate. The garbage collection policy — "keep the 10
most recent versions" or "keep versions from the last 30 days" — provides the
counterbalance, keeping storage bounded while preserving recent history.
Infinite is the default for many model types, including command/shell. This is
deliberate: it is better to retain data that turns out to be unnecessary than to
discard data that turns out to be needed. When you know a particular use case
produces disposable data, override the lifetime explicitly rather than relying
on the default.
Mixing lifetimes in a workflow
A single workflow can mix lifetimes across steps using dataOutputOverrides.
This is the common pattern for pipelines that produce both intermediate and
final outputs:
- Early steps (data fetching, raw API responses):
ephemeralorworkflow - Middle steps (transformation, aggregation):
ephemeral - Final steps (reports, notifications, provisioned resources):
infiniteor a duration
The override applies per step and per output spec, so a step that produces both a resource and a log can set them to different lifetimes. See the how-to guide for the mechanics.
Deferred resolution and ephemeral data
Ephemeral data introduces a timing constraint that does not apply to
disk-persisted data. Because ephemeral data exists only in memory, it must be
read while the process is still running. In a workflow, this means the
downstream step's data.latest() expression must be evaluated at step execution
time — after the upstream step has written its output — not during the initial
workflow evaluation pass, when the data does not yet exist.
The workflow engine handles this automatically. Data functions (data.latest(),
data.query(), data.findByTag(), and others) in step task.inputs are
treated as deferred step-output dependencies. The engine skips them during
evaluation and resolves them when the step is about to run. This is the same
mechanism that makes non-ephemeral step-to-step data flow work, but it is
essential for ephemeral data — without deferral, the expression would evaluate
against an empty store and fail.
The --last-evaluated flag preserves this behavior: it skips the full
evaluation pass but still resolves deferred data expressions at each step's
execution time. For scheduled workflows that poll dynamic infrastructure
repeatedly, this combination — ephemeral data with --last-evaluated — avoids
both unnecessary re-evaluation and unnecessary storage.
See The Workflow Execution Model for the broader design of workflow data flow.