Skip to main content
← Back to list
01Issue
BugShippedSwamp CLI
Assigneeskeeb

Relationships

#817 Telemetry ingest is consumer-bound: counter & stats dedup via O(N) sequential insertOne, throughput stuck ~20 events/s regardless of BATCH_SIZE

Opened by keeb · 6/25/2026· Shipped 6/25/2026

Summary

The telemetry watcher's throughput is gated by two consumers (stats, counter) that do per-event sequential MongoDB round-trips. As a result ingest tops out around ~20 events/sec and does not scale with BATCH_SIZE or POLL_INTERVAL_MS — tuning those env vars has no effect.

Impact

On 2026-06-25 the watcher stalled (13:12Z) and accumulated a large pending backlog in telemetry_events. To drain faster we set BATCH_SIZE=500 / POLL_INTERVAL_MS=20 on the prod telemetry-api deployment and rolled it. Measured drain after the change: **21 events/sec** — essentially identical to the default (BATCH_SIZE=50/POLL_INTERVAL_MS=2000 ≈ 25/s). One 500-batch turns over in ~25s, so a ~200k backlog is hours, not minutes. Those env settings are the wrong lever.

Root cause

ConsumerRegistry.dispatchTo runs all consumers in parallel (Promise.allSettled, services/telemetry/lib/consumer.ts:53), so a batch's wall-time = the slowest consumer. The two slowest do work that is linear AND sequential in batch size:

  • services/telemetry/lib/consumers/stats.ts:121-137dedup-inserts does await inserts.insertOne(...) in a for loop, one awaited round-trip per event. The code even comments it is "the typical bottleneck" (stats.ts:107-108). Plus per-user sequential updateOne loops in upsertPerDistinctId (:282), updateIdentityMap (:417, two writes/user), and a full-history re-aggregate per affected username every batch (recomputeUsernameMetricsProjection, :477 — reads every device doc and re-sums).
  • services/telemetry/lib/consumers/counter.ts:143-161 — same per-event sequential insertOne dedup loop.

Because the work is N sequential awaits, a 10x batch = ~10x the time = same throughput. POLL_INTERVAL_MS=20 is moot since batch processing (~25s) >> the poll gap.

services/telemetry/lib/consumers/metrics.ts:353 already fixed this exact pattern with a single bulkWrite(ordered:false) for dedup (and :415 for rollups). counter and stats were never converted. (clickhouse and s3 are also fine — one bulk call per batch.)

Cross-cutting: dedup is done three separate times against three collections (processed_inserts, counter_processed_inserts, metrics_processed_inserts).

Proposed fixes (ranked)

  1. Convert counter + stats dedup to bulkWrite(ordered:false) — copy the metrics pattern. N round-trips -> 1. Makes BATCH_SIZE actually buy throughput. Low risk; pattern already in-tree.
  2. Dedup once in the watcher before fan-out instead of 3x per-consumer.
  3. Batch stats' per-user writes (bulkWrite over users, not sequential updateOne).
  4. Stop re-aggregating full per-user history every batch (recomputeUsernameMetricsProjection) — go incremental or move off the hot path.
  5. Take derived bookkeeping (per-user metrics, identity map, username projection, milestones) off the FIFO critical path. The archive (s3/clickhouse) is already one-shot; the slowness is all derived state piled into the synchronous fan-out. This also means one stuck consumer freezes the entire pipeline (as happened in the stall above).
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 6 MOREREVIEW+ 3 MOREPR_MERGED+ 1 MORENOTIFICATION_SKIPPED

Shipped

6/25/2026, 4:59:03 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
keeb assigned keeb6/25/2026, 4:01:12 PM

Sign in to post a ripple.