Skip to main content
← Back to list
01Issue
BugOpenSwamp Club
AssigneesNone

Relationships

#827 Telemetry retry/failed path has the same non-atomic claim as #820

Opened by keeb · 6/26/2026

Follow-up to #820 (PR #749), which fixed the pending path only.

Summary

retryFailed in services/telemetry/lib/watcher.ts has the identical non-atomic structure that #820 fixed for the pending path: it runs find({status:"failed", attempts:{$lt:max}}).sort().limit() and then dispatchBatch issues a separate updateMany({_id:$in},{status:"processing"}). Between the two, a second replica can find the same failed docs, so both replicas re-dispatch the same retry batch through the consumer fan-out.

Lower volume than the pending path (retry runs every RETRY_INTERVAL_MS, default 60s, over a usually-small failed set), and still covered by idempotent consumers, so no data corruption — but it is the same wasted-duplicate-work race.

Fix

Reuse the claimBatch helper added in #820. Widen its fromStatus parameter to "pending" | "failed" and claim in retryFailed before dispatchBatch:

const claimed = await claimBatch(collection, failedDocs, "failed");
if (claimed.length === 0) return;
const targets = pendingConsumersForBatch(claimed, registry.names);
...

The CAS guard (status:"failed") ensures each failed doc is retried by exactly one replica.

Notes

  • claimBatch is already exported and unit-tested in watcher_claim_test.ts; add a fromStatus:"failed" case.
  • Small, self-contained change once #820 lands.
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED

Open

6/26/2026, 3:20:34 AM

No activity in this phase yet.

03Sludge Pulse

Sign in to post a ripple.