Failure Semantics

This page defines the failure and recovery behavior of remote workers during execution. It covers what happens when a control socket drops, how in-flight dispatches are handled depending on step type, how cancellation propagates, and how session credentials are maintained.

Reconnection grace window

When the WebSocket control socket between a worker and the orchestrator drops, the worker automatically attempts to reconnect. This behavior is on by default; passing --no-reconnect to swamp worker disables it and causes the worker to exit immediately on disconnect.

The orchestrator holds the worker's registration for a grace period after the socket drops. During this window:

The worker's pool membership is preserved. It is not removed from the dispatch roster.
In-flight dispatches assigned to that worker are held in a pending state. They are not immediately failed or re-dispatched.
If the worker reconnects within the grace window, it resumes its registration and in-flight dispatches continue normally.
If the grace window expires without reconnection, the worker is deregistered and in-flight dispatches are resolved according to the rules in the next section.

In-flight dispatch on socket drop

When a worker disconnects and the grace window expires (or --no-reconnect was set), the orchestrator must decide how to handle each in-flight dispatch. The decision depends on whether the step performs writes.

Important

The write-then-fail rule. Write-bearing steps are failed immediately on disconnect — they are never re-dispatched. A write may have partially completed on the worker before the socket dropped; re-dispatching the step to another worker could cause double-writes or leave data in an inconsistent state. The workflow run fails for that step.

No-write steps

Steps that perform only read operations (queries, fetches, validations) are safe to re-dispatch. After the grace window expires, the orchestrator re-dispatches the step to the same worker (if it reconnects later) or to another matching worker in the pool. The step executes from the beginning — there is no partial resume.

Write-bearing steps

Steps that modify state (creates, updates, deletes, mutations) are failed immediately. The orchestrator marks the step as failed with a disconnect reason. The workflow run records the failure. No retry is attempted by the orchestrator.

The distinction between no-write and write-bearing is declared at the step level in the workflow definition. The orchestrator does not inspect step behavior at runtime — it relies on the declared step metadata.

Queue timeout

When a placed step is queued because no matching worker is available, it waits for a configurable timeout. If no matching worker appears within the timeout, the step fails with an error naming the unmet placement requirement.

The timeout is layered: per-step queueTimeout (seconds, in workflow YAML) > --queue-timeout flag on swamp serve > default 10m. 0 at any layer disables the timeout — the step queues indefinitely until a matching worker enrolls or the workflow is cancelled.

A queue timeout is an ordinary step failure. Downstream steps with dependsOn: failed conditions fire normally. The step never executed on a worker, so there is no in-flight dispatch to resolve — the orchestrator marks the step as failed and proceeds with dependency evaluation.

Cooperative cancellation

When the orchestrator cancels a dispatched step — because the workflow was cancelled, a timeout was reached, or a dependent step failed — it sends a cancellation signal over the control socket to the worker executing that step.

The worker cooperatively terminates the running step. "Cooperatively" means the worker acknowledges the signal and stops execution at the next safe point. The step is not killed mid-syscall.

If the control socket is already disconnected when the orchestrator decides to cancel:

Cancellation is best-effort. The orchestrator cannot deliver the signal.
The step may run to completion on the worker side, but its result is discarded by the orchestrator. The orchestrator has already marked the step as cancelled.
If the worker later reconnects and attempts to report the step's result, the orchestrator rejects the report with a stale-dispatch error.

Graceful drain

A drain causes the worker to finish in-flight work, reject new dispatches, and disconnect with exit code 0. Three triggers initiate a drain:

Signal — SIGTERM (Unix) or SIGINT. A second signal during drain force-exits the process with a non-zero exit code.
--max-dispatches <n> — the worker drains after completing N dispatches.
--idle-timeout <duration> — the worker drains after being continuously idle for the specified duration.

All three triggers follow the same sequence:

The worker stops accepting new dispatches.
In-flight dispatches run to completion.
The worker sends a worker.drain message to the orchestrator.
The worker disconnects and exits 0.

The orchestrator marks a draining worker as draining. A draining worker is excluded from dispatch scheduling. When the worker disconnects, no reconnection grace window applies and no in-flight dispatches are re-dispatched.

A dispatch that arrives while the worker is draining is rejected with worker_draining. The orchestrator re-queues the step, following the same re-queue path as worker_busy.

For the exit-code contract, see Worker Commands — Exit-code contract.

Session credential lifetime

A worker receives a session credential from the orchestrator during enrollment. This credential authenticates the worker on subsequent connections and is distinct from the enrollment token (which is consumed on first use).

The session credential uses a sliding refresh mechanism:

While the control socket is healthy, the orchestrator periodically re-issues the credential. Each re-issue extends the credential's validity window.
The worker logs each successful refresh at the default log level. Under normal operation, these log lines confirm the connection is healthy. No action is needed.
If a credential refresh fails (the orchestrator does not respond or returns an error), the worker logs a warning. If the credential expires without a successful refresh, the worker disconnects.

Credential refresh failures are an early signal of connectivity problems. A warning log from the worker typically precedes a full socket drop by one refresh interval.

For details on credential isolation, rotation, and the trust model between workers and the orchestrator, see the Security reference.