Skip to main content
← Back to list
01Issue
BugShippedSwamp Club
Assigneeskeeb

Relationships

#820 Telemetry watcher has no replica coordination: N replicas double-process the same batch (non-atomic find→updateMany claim)

Opened by keeb · 6/25/2026· Shipped 6/26/2026

Summary

The telemetry watcher runs unconditionally in every pod — startWatcher at services/telemetry/main.ts:137 has no leader election, lease, or singleton guard (confirmed: no leader/lease/claim coordination anywhere in services/telemetry/). The deployment now runs 2 replicas.

The batch claim is non-atomic: processPending does collection.find({ status: "pending" }).sort().limit() (services/telemetry/lib/watcher.ts:163) and then, as a separate statement, collection.updateMany({ _id: { $in } }, { $set: { status: "processing" } }) (:233). Between the find and the updateMany, a second replica can find the same pending docs — so both replicas mark the same batch processing and both dispatch it through the full consumer fan-out.

Why it doesn't corrupt data — but is still a bug

Correctness is held up entirely by per-consumer idempotency: insert_id dedup (stats/counter/metrics *_processed_inserts), deterministic queue _id (discord/scoring), upsert (extensions), at-least-once (s3/clickhouse). So no double-counts — but:

  • Wasted duplicate work: both replicas run the entire consumer fan-out for the same events — ~2x Mongo + S3 + ClickHouse load for zero gain.
  • Duplicate archive inserts: S3 + ClickHouse are at-least-once; clickhouse.ts already notes ~3% dup batches and relies on query-time dedup — this amplifies it.
  • Mongo write contention: concurrent updateMany(processing) / deleteMany on the same _ids.
  • Throughput ceiling: at high rate (post-#819) the two watchers race on the head of the FIFO constantly, so effective throughput is well below 2x replicas and a large fraction of work is duplicated. This becomes the next bottleneck the moment the consumers stop being the bottleneck.

Evidence (prod, 2026-06-25)

2 telemetry-api replicas logging batch processing concurrently against the same FIFO — e.g. pod lm8lj batches at 17:22:06 / 17:22:12, pod m6pg2 at 17:22:09 / 17:22:14 (interleaved). No coordination in code.

Proposed (pick one)

  1. Atomic claim — replace find-then-updateMany with a findOneAndUpdate (pending→processing) per doc/batch, or a claimed_by + claimed_at token so a batch is owned by exactly one worker. Keeps horizontal scaling.
  2. Sharded poll — each replica takes a disjoint partition (e.g. hash(_id) % replicaCount), so no two replicas see the same doc. Keeps horizontal scaling.
  3. Leader election / single-writer — a Mongo lease doc with TTL so only one replica runs the watcher; scale reads, not the writer. Simplest, but no write throughput gain from replicas.

Prefer (1) or (2) if the intent of running 2 replicas was throughput.

Environment

prod telemetry-api (DigitalOcean sfo3), 2 replicas, MongoDB Atlas. Related: #817 (throughput scaling), #819 (deferred per-user recompute). This race is the next ceiling once #819 lands.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 5 MOREREVIEW+ 3 MOREPR_MERGED+ 1 MORENOTIFICATION_SKIPPED

Shipped

6/26/2026, 4:07:12 AM

Click a lifecycle step above to view its details.

03Sludge Pulse
keeb assigned keeb6/25/2026, 5:37:49 PM
Editable. Press Enter to edit.

keeb commented 6/26/2026, 3:21:01 AM

Follow-ups for the deliberately out-of-scope races identified while fixing this (both still covered by idempotent consumers today): #827 — retry/failed path has the same non-atomic claim; #828 — recoverOrphaned startup race (created_at-based). Fix is in PR #749 (pending path only).

Sign in to post a ripple.