Skip to main content
← Back to list
01Issue
BugOpenSwamp Club
AssigneesNone

Relationships

#828 Telemetry recoverOrphaned startup race with multiple replicas (created_at-based)

Opened by keeb · 6/26/2026

Follow-up surfaced during #820 (PR #749). Pre-existing, independent of the pending-claim race.

Summary

recoverOrphaned in services/telemetry/lib/watcher.ts runs once at watcher startup and resets stuck processing docs back to pending using a created_at-based threshold:

updateMany({ status: "processing", created_at: { $lt: now - 5min } }, { $set: { status: "pending" } })

With 2+ replicas, when replica B starts up (deploy/restart) while replica A is actively processing a batch, B's recoverOrphaned can flip A's in-flight processing docs back to pending if their created_at is older than the threshold (e.g. a backlog). B then re-claims and re-dispatches docs A is still working — a transient double-process at startup.

Still covered by idempotent consumers (no corruption), and only at startup, so low severity — but created_at is the wrong clock: it is the event's creation time, not its claim time.

Fix

#820 added a claimed_at field (set alongside claim_token on claim) precisely to enable this. Switch recoverOrphaned to a claim-age threshold:

{ status: "processing", $or: [ { claimed_at: { $lt: threshold } }, { claimed_at: { $exists: false } } ] }

The $exists:false branch keeps recovering any legacy/edge processing doc that predates claimed_at (so nothing is stranded after deploy). A doc A just claimed has a recent claimed_at, so B's startup recovery no longer yanks it.

Notes

  • Update the recoverOrphaned tests in watcher_test.ts / watcher_recovery_test.ts to drive the new claimed_at threshold.
  • Depends on #820 (claimed_at field) being merged.
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

6/26/2026, 3:20:50 AM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.