Skip to main content

FAILURE SEMANTICS

This page defines the failure and recovery behavior of remote workers during execution. It covers what happens when a control socket drops, how in-flight dispatches are handled depending on step type, how cancellation propagates, and how session credentials are maintained.

Reconnection grace window

When the WebSocket control socket between a worker and the orchestrator drops, the worker automatically attempts to reconnect. This behavior is on by default; passing --no-reconnect to swamp worker disables it and causes the worker to exit immediately on disconnect.

The orchestrator holds the worker's registration for a grace period after the socket drops. During this window:

  • The worker's pool membership is preserved. It is not removed from the dispatch roster.
  • In-flight dispatches assigned to that worker are held in a pending state. They are not immediately failed or re-dispatched.
  • If the worker reconnects within the grace window, it resumes its registration and in-flight dispatches continue normally.
  • If the grace window expires without reconnection, the worker is deregistered and in-flight dispatches are resolved according to the rules in the next section.

In-flight dispatch on socket drop

When a worker disconnects and the grace window expires (or --no-reconnect was set), the orchestrator must decide how to handle each in-flight dispatch. The decision depends on whether the step performs writes.

Important

The write-then-fail rule. Write-bearing steps are failed immediately on disconnect — they are never re-dispatched. A write may have partially completed on the worker before the socket dropped; re-dispatching the step to another worker could cause double-writes or leave data in an inconsistent state. The workflow run fails for that step.

No-write steps

Steps that perform only read operations (queries, fetches, validations) are safe to re-dispatch. After the grace window expires, the orchestrator re-dispatches the step to the same worker (if it reconnects later) or to another matching worker in the pool. The step executes from the beginning — there is no partial resume.

Write-bearing steps

Steps that modify state (creates, updates, deletes, mutations) are failed immediately. The orchestrator marks the step as failed with a disconnect reason. The workflow run records the failure. No retry is attempted by the orchestrator.

The distinction between no-write and write-bearing is declared at the step level in the workflow definition. The orchestrator does not inspect step behavior at runtime — it relies on the declared step metadata.

Cooperative cancellation

When the orchestrator cancels a dispatched step — because the workflow was cancelled, a timeout was reached, or a dependent step failed — it sends a cancellation signal over the control socket to the worker executing that step.

The worker cooperatively terminates the running step. "Cooperatively" means the worker acknowledges the signal and stops execution at the next safe point. The step is not killed mid-syscall.

If the control socket is already disconnected when the orchestrator decides to cancel:

  • Cancellation is best-effort. The orchestrator cannot deliver the signal.
  • The step may run to completion on the worker side, but its result is discarded by the orchestrator. The orchestrator has already marked the step as cancelled.
  • If the worker later reconnects and attempts to report the step's result, the orchestrator rejects the report with a stale-dispatch error.

Session credential lifetime

A worker receives a session credential from the orchestrator during enrollment. This credential authenticates the worker on subsequent connections and is distinct from the enrollment token (which is consumed on first use).

The session credential uses a sliding refresh mechanism:

  • While the control socket is healthy, the orchestrator periodically re-issues the credential. Each re-issue extends the credential's validity window.
  • The worker logs each successful refresh at the default log level. Under normal operation, these log lines confirm the connection is healthy. No operator action is needed.
  • If a credential refresh fails (the orchestrator does not respond or returns an error), the worker logs a warning. If the credential expires without a successful refresh, the worker disconnects.

Credential refresh failures are an early signal of connectivity problems. A warning log from the worker typically precedes a full socket drop by one refresh interval.

For details on credential isolation, rotation, and the trust model between workers and the orchestrator, see the Security reference.