Skip to main content
← Back to list
01Issue
BugShippedSwamp Club
Assigneeskeeb

Relationships

#819 Telemetry drain still capped ~80-100/s in prod: per-username full-history re-aggregate is O(users) sequential per batch (deferred #817 fix #4)

Opened by keeb · 6/25/2026· Shipped 6/25/2026

Summary

The #817 merge (983e271) fixed the dedup O(N) loops (now bulkWrite) and batched the per-user metric + identity-map writes — both verified in prod. But fix #4 (stop re-aggregating full per-user history every batch) was deferred, and it is now the dominant bottleneck. Live prod drain measures ~80-100 events/sec, against the 1944 events/sec the in-PR throughput harness reported. The harness exercises the dedup path; the deferred per-username recompute gates the real backlog.

Measured (prod, 2026-06-25, after deploy of 983e271)

  • 2 telemetry-api replicas, BATCH_SIZE=500 / POLL_INTERVAL_MS=20.
  • Cursor (oldest pending created_at) advances ~1.5x event-time per wall-second — flipped from ~0.4x (falling behind) pre-fix to catching up, so the fix is a real improvement.
  • Estimated throughput ~80-100 events/sec: a processing batch of ~600 docs spans ~11s of the dense cli_invocation region (~55 events/event-second) x ~1.5x drain = ~80-100/s. vs harness 1944/s.

Remaining bottleneck (the deferred fix #4)

services/telemetry/lib/consumers/stats.ts recomputeUsernameMetricsProjection:

for (const username of affected) {
  await identityMap.findOne({ username });               // 1 read/user
  await userMetrics.find({ distinct_id: { $in: ids } });  // reads ALL device docs
  aggregateUserMetrics(readyDocs);                        // full-history re-sum
  await usernameMetrics.updateOne(...);                   // 1 write/user
}

Sequential, O(affected_usernames) x (3 round-trips + full re-aggregate), every batch. Dense backlog batches touch many unique authenticated users, so this dominates and masks the dedup bulkWrite win.

What landed vs deferred (from #817)

  • #1 dedup bulkWrite — DONE (counter.ts:157, stats.ts:138)
  • #3 batch per-user writes — DONE for upsertPerDistinctId (stats.ts:371) and updateIdentityMap (stats.ts:472); upsertSignInDates (stats.ts:434) is still a sequential per-user updateOne loop (minor — daily_sign_in only)
  • #4 stop full re-aggregate per batch — NOT done (this issue)
  • #2 unify the 3 separate dedup passes/collections — NOT done
  • #5 move derived bookkeeping off the FIFO critical path — NOT done

Proposed

  • Make the username projection incremental (forward $inc) instead of a full-history re-aggregate, or batch+debounce it out of the synchronous fan-out (ties into #5). Also batch upsertSignInDates.

Environment

prod telemetry-api (DigitalOcean sfo3, 2 replicas), MongoDB Atlas (swamp-club.brn8dk.mongodb.net). Follow-up to #817 (merged as 983e271); fixes #2/#4/#5 remain.

02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 4 MOREREVIEW+ 3 MOREPR_MERGED+ 1 MORENOTIFICATION_SKIPPED

Shipped

6/25/2026, 5:56:51 PM

Click a lifecycle step above to view its details.

03Sludge Pulse
keeb assigned keeb6/25/2026, 5:12:29 PM

Sign in to post a ripple.