Skip to main content

DATA LIFETIMES

Every data artifact in Swamp has a lifetime that determines when it becomes eligible for garbage collection. The choice is not about storage mechanics — it expresses what the data means and how long it remains true.

The spectrum

Five lifetimes span from no durability to permanent retention:

Lifetime Survives process exit Survives workflow end Survives indefinitely
ephemeral No No No
job Yes No No
workflow Yes No No
Duration Yes Yes No
infinite Yes Yes Yes

The leftmost column is the important one. Ephemeral data exists only in memory for the duration of the process. Everything else is written to disk and survives restarts. The question is how long it stays before garbage collection removes it.

When ephemeral fits

Ephemeral data is for values that are true right now and meaningless moments later. The canonical examples are dynamic infrastructure inventories: the set of running Kubernetes pods, the current members of an AWS Auto Scaling group, the active containers in a Docker Swarm service. These change constantly. Storing yesterday's pod list does not tell you what is running today — it tells you what was running, which is a different question served by audit logs, not by data artifacts.

The same reasoning applies to intermediate computation results in multi-step workflows. When step one fetches raw API responses and step two aggregates them, the raw responses have served their purpose once the aggregation is complete. Persisting them creates storage debt and noise in queries without adding value.

The tradeoff is durability. If the process crashes mid-workflow, ephemeral data is lost and the workflow cannot resume from where it stopped — it must restart from scratch. For workflows where each step takes seconds, this is acceptable. For workflows with expensive steps (long-running scans, rate-limited API calls), consider workflow lifetime instead, which survives crashes but is still cleaned up when the run ends.

Ephemeral data has a memory budget — 512 MB by default, configurable via SWAMP_EPHEMERAL_BUDGET. This prevents a misbehaving step from consuming all available memory. The budget is a signal: if your intermediate data routinely approaches it, the data may be too large for in-memory storage and a disk-backed lifetime (workflow or a short duration) may be more appropriate.

When job and workflow fit

job and workflow lifetimes scope data to the execution that produced it. They are written to disk, so they survive process crashes and support workflow resume. Once the workflow run is cleaned up, the data goes with it.

These lifetimes suit artifacts that need to be durable within a run but have no value between runs. Test results within a CI pipeline, deployment manifests generated for a specific release, temporary credentials rotated per execution — all are tied to a specific run and should not outlive it.

The distinction between job and workflow matters in multi-job workflows. A job-scoped artifact is garbage-collected when its job completes, even if the workflow continues. A workflow-scoped artifact survives until the entire run ends. Choose job when a later job in the same workflow should not see the data; choose workflow when it should.

When duration fits

Duration lifetimes (1h, 7d, 30d, 1y) express data that is valid for a bounded window. The window is wall-clock time from creation, not tied to any execution.

This fits reporting and compliance use cases. A weekly security scan produces a report that is actionable for about seven days — after the next scan runs, the old report is superseded. Setting lifetime: 7d means the report is available for review throughout the week and is automatically cleaned up when the next one arrives. No manual deletion, no unbounded storage growth.

Duration also fits caching patterns. A model that resolves DNS records might cache results with lifetime: 1h. Downstream models that read the cached data within the hour get a fast lookup; after the hour, the stale data is collected and the next query triggers a fresh resolution.

The choice of duration is a judgement call that depends on how quickly the data loses relevance. Shorter durations reduce storage cost and query noise. Longer durations preserve history for rollback and investigation. There is no universal right answer — it depends on the domain.

When infinite fits

infinite lifetime means the data is never automatically deleted. It persists until someone explicitly removes it or the repository is destroyed.

This is the right choice for data that serves as a historical record or a long-lived configuration artifact. The result of a model that provisions a VPC should be infinite — the VPC ID is referenced by other models indefinitely, and deleting the data would break those references. Audit logs, compliance evidence, and configuration baselines all belong at infinite lifetime.

The cost is storage. Every method execution that writes to an infinite-lifetime output creates a new version that is retained until garbage collection prunes old versions (controlled by the garbageCollection policy). For models that run frequently, this can accumulate. The garbage collection policy — "keep the 10 most recent versions" or "keep versions from the last 30 days" — provides the counterbalance, keeping storage bounded while preserving recent history.

Infinite is the default for many model types, including command/shell. This is deliberate: it is better to retain data that turns out to be unnecessary than to discard data that turns out to be needed. When you know a particular use case produces disposable data, override the lifetime explicitly rather than relying on the default.

Mixing lifetimes in a workflow

A single workflow can mix lifetimes across steps using dataOutputOverrides. This is the common pattern for pipelines that produce both intermediate and final outputs:

  • Early steps (data fetching, raw API responses): ephemeral or workflow
  • Middle steps (transformation, aggregation): ephemeral
  • Final steps (reports, notifications, provisioned resources): infinite or a duration

The override applies per step and per output spec, so a step that produces both a resource and a log can set them to different lifetimes. See the how-to guide for the mechanics.

Deferred resolution and ephemeral data

Ephemeral data introduces a timing constraint that does not apply to disk-persisted data. Because ephemeral data exists only in memory, it must be read while the process is still running. In a workflow, this means the downstream step's data.latest() expression must be evaluated at step execution time — after the upstream step has written its output — not during the initial workflow evaluation pass, when the data does not yet exist.

The workflow engine handles this automatically. Data functions (data.latest(), data.query(), data.findByTag(), and others) in step task.inputs are treated as deferred step-output dependencies. The engine skips them during evaluation and resolves them when the step is about to run. This is the same mechanism that makes non-ephemeral step-to-step data flow work, but it is essential for ephemeral data — without deferral, the expression would evaluate against an empty store and fail.

The --last-evaluated flag preserves this behavior: it skips the full evaluation pass but still resolves deferred data expressions at each step's execution time. For scheduled workflows that poll dynamic infrastructure repeatedly, this combination — ephemeral data with --last-evaluated — avoids both unnecessary re-evaluation and unnecessary storage.

See The Workflow Execution Model for the broader design of workflow data flow.