Skip to main content
← Back to list
01Issue
BugShippedSwamp CLI
Assigneesstack72

Relationships

#734 workflow resume holds the global lock across the resumed step, deadlocking any datastore op the step performs

Opened by vcjdeboer · 6/21/2026· Shipped 6/21/2026

Summary

swamp workflow resume executes the resumed step under the global __global__ datastore lock (held as a "structural command") for the entire step. Any datastore operation the step performs during its execution — e.g. an independent swamp model method run on another model (recorders, fan-out, nested orchestration) — blocks on that global lock until the structural command finishes, which cannot happen until the step finishes → self-deadlock. The step hangs until its own timeout. A plain swamp workflow run (and a direct swamp model method run) does not take this global lock, so the identical step succeeds. No concurrency is involved: a single resumed workflow starves its own subprocess.

This is distinct from #520 (per-model lock under concurrent access). Here it is the global lock, on the resume / structural-command path — swamp's own log names both.

Environment

  • swamp 20260621.154815.0-sha.b4781e86 (also reproduced on the prior …d9abe1a2)
  • macOS (darwin), local filesystem datastore (no Postgres/S3)

Steps to Reproduce

Two trivial models (no domain deps):

  1. sink — its run method writes one small resource.
  2. spawn — its run method synchronously shells out (Deno.Command) to swamp model method run <sink> run (a different model), bounded by a kill-timeout, recording each inner call's exit code + elapsed ms.

Two workflows:

  • A (executes in a run process): one step, spawn.run.
  • B (executes in a resume process): a manual_approval gate, then spawn.run.

Run A directly. For B: swamp workflow run B (suspends at the gate), swamp workflow approve B gate, swamp workflow resume B.

Observed Behavior

A (run process): inner call code=0 ms≈1100 timedOut=false — succeeds in ~1s.

B (resume process):

INF datastore·lock Acquiring lock for "__global__"
INF datastore·lock Global lock held by "user@host" — waiting for structural command to finish
... inner call: code=137 ms=10036 timedOut=true   (SIGKILL at the 10s kill-cap)

The inner cross-model call blocks on __global__ until it is killed. With no kill-cap it hangs until the step's own timeout — in a real recorder-style workload (a step that ships records via synchronous swamp … run subprocess calls) the step timed out at a full 5 minutes.

Expected Behavior

  • workflow resume should not run the resumed step under the global structural-command lock; it should use the same per-model / released-between-steps locking as workflow run.
  • A step must be able to perform datastore operations on other models during its execution without deadlocking on a lock the resume itself holds.

Impact

  • Any workflow with a manual_approval (or otherwise resumed) whose post-resume step calls back into swamp — recorders, fan-out, nested workflows — deadlocks on resume.
  • The deadlock is silent from the holder's side until the step's timeout; no LockTimeoutError surfaces.

Notes

  • Distinct from #520: that is the per-model lock under concurrency; this is the global lock and is reproducible single-threaded. swamp's log here explicitly names the global lock and a "structural command".
  • The same step runs fine via workflow run and via direct model method run, so the regression is specific to the resume / structural-command code path.
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 2 MOREREVIEW+ 3 MOREPR_MERGED+ 1 MORECONTRIBUTOR_NOTIFIED

Shipped

6/21/2026, 8:24:45 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack726/21/2026, 7:29:57 PM
Editable. Press Enter to edit.

stack72 commented 6/21/2026, 8:24:56 PM

Thanks @vcjdeboer for reporting this! The fix has been merged and a release is on its way. We appreciate your contribution to swamp.

Sign in to post a ripple.