Skip to main content
← Back to list
01Issue
FeatureShippedSwamp CLI
Assigneesstack72

Relationships

#843 Make serve WebSocket idle/keepalive timeout configurable (untunable default aborts runs when serve's loop briefly blocks)

Opened by swamp_lord · 6/26/2026· Shipped 6/27/2026

Problem

swamp serve closes a connection's WebSocket when Deno's WS keepalive gets no pong in time, and there is no way to tune the ping interval or pong timeout — no serve flag, no workflow run flag, no SWAMP_* env (workflow run --timeout is only a run cancellation deadline). serve calls Deno.upgradeWebSocket(req) with NO options (src/cli/commands/serve.ts:898/902), so the liveness behavior is Deno's hardcoded default.

The failure mode: whenever serve's single JS event loop is blocked longer than that timeout, it can't process incoming pong frames on ANY connection, so Deno's keepalive fires on all of them at the same instant and serve closes them:

serve·connection  WebSocket error: "No response from ping frame."   (×N — every live connection, same millisecond)

CORRECTION: an earlier version of this report attributed the drop to client-side CPU starvation / worker cold-start spikes. That is wrong. After tracing: the peers (workers + client) were healthy and ponging; serve simply could not read them while its own loop was blocked. The host was a 12-core box, orchestrator uncapped, load ~5.5 — not CPU-bound.

What blocked the loop in our case (~58s): a forEach fans out to 7 model instances that all share ONE extension type, and serve rebuilds that type's worker ship-bundle once per instancebundleSourceFactory is memoized per model-definition, not per type — via a deno bundle subprocess, ~8.4s each, run serially → ~58s (the per-instance "Found model …" events land exactly 8.4s apart). That redundant-bundling inefficiency is a separate concern; the point for THIS issue is that any multi-second serve-loop block silently kills every live connection because the keepalive is un-tunable.

Two things make it worse than a cosmetic disconnect:

  1. The run is coupled to the connection — when the sockets drop, the in-flight placed methods fail (Cannot push to a closed AsyncQueue) and the run aborts mid-flight.
  2. Downstream teardown is skipped, so provisioned resources leak — 7 worker containers orphaned, reaped by hand.

Proposed Solution

Make the WebSocket liveness timeout explicit and configurable (defaults preserved):

  • The minimal fix: pass an explicit idleTimeout to the two Deno.upgradeWebSocket(req) calls (serve.ts:898/902) instead of inheriting Deno's default, and make it configurable:
    • swamp serve --ws-idle-timeout <duration> (env SWAMP_WS_IDLE_TIMEOUT).
  • Optionally split into --ws-ping-interval / --ws-pong-timeout if finer control is wanted.

Raising the timeout lets a transiently-blocked serve loop recover without dropping healthy connections, and lets operators confirm the behavior (set it generous, re-run, see the long fan-out survive).

NOTE: a generous timeout is a BACKSTOP, not the real fix for our case. The real fix is to stop blocking the loop — cache the ship-bundle per type so a same-type fan-out builds once instead of N times, and/or build off the heartbeat path. But a configurable timeout is valuable independently: serve should not abort a run because its own loop was briefly busy.

Affected Components

  • serve WebSocket setup — pass an explicit, configurable idleTimeout to Deno.upgradeWebSocket (serve.ts:898/902) instead of relying on the runtime default.
  • swamp serve flag + env plumbing.
  • Docs — list the new flag/env alongside the existing serve options.

Why It Matters

serve's event loop will occasionally be busy for several seconds (bundling, large dispatch prep, GC). With an un-tunable, runtime-default keepalive, any such pause silently drops every connected worker AND the client at once, fails the in-flight methods, and leaks whatever the run provisioned. Long-running self-hosted fan-out workflows — exactly what serve + remote execution exist for — are the most exposed. A configurable timeout makes the system resilient to its own brief stalls.

  • The tactical complement to detached / resumable runs (Lab #519): #519 removes the run↔connection coupling so a drop can't abort a run; this makes the drop far less likely in the first place.
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 2 MOREREVIEW+ 3 MOREPR_MERGED+ 1 MORECONTRIBUTOR_NOTIFIED

Shipped

6/27/2026, 12:28:39 AM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack726/26/2026, 11:29:55 PM
Editable. Press Enter to edit.

stack72 commented 6/27/2026, 12:29:24 AM

Thanks @swamp_lord for reporting this! The fix has been merged and a release is on its way. We appreciate your contribution to swamp.

Sign in to post a ripple.