Skip to main content
← Back to list
01Issue
BugShippedSwamp CLI
Assigneesstack72

#225 S3 datastore hydrate has no fallback for buckets without .datastore-index.json

Opened by ynm · 5/4/2026· Shipped 5/5/2026

Description

When a bucket holds data under standard prefixes (data/, workflow-runs/, outputs/, …) but no .datastore-index.json at the bucket root, swamp datastore setup extension [@swamp](/u/swamp)/s3-datastore --config '{…}' --skip-migration (the post-Lab-220 hydration path) reports Hydrated: 0 pulled. The local cache stays empty even though the bucket has megabytes of data the contributor needs.

This is the residual case Lab #220's fix did not cover. We hit it on a real shared bucket whose contents were uploaded by an earlier flow (likely an older s3-datastore version, or via direct aws s3 cp from a CI workflow) that never wrote an index file. From the contributor's point of view the new hydrate step "succeeds" — setup exits 0 and swamp datastore status is healthy — but the cache is empty and follow-up reads (swamp data list, swamp workflow run search, etc.) return nothing.

Steps to reproduce

  1. Have an S3 bucket with workflow-runs/<id>/workflow-run-*.yaml, data/…, etc., but no .datastore-index.json. Verify:
    aws s3api list-objects-v2 --bucket <bucket> \
      --query 'Contents[?starts_with(Key, `.`)].Key'
    If only .datastore.lock shows up, you are in this case.
  2. On a fresh checkout, run:
    swamp datastore setup extension @swamp/s3-datastore \
      --config '{"bucket":"<bucket>","region":"<region>"}' \
      --skip-migration
  3. Output includes Hydrating cache from remote...Hydrated: 0 pulled (despite the bucket containing data).
  4. swamp workflow run search, swamp data list, etc. all return empty.

Expected

When .datastore-index.json is absent (HeadObject 404), hydrate should fall back to a ListObjectsV2 walk of the bucket, treat every non-internal object as a new entry, download it, and persist a freshly-built index. After this self-healing pass, subsequent operations behave normally — and the bucket is back on the indexed-sync path for all future writers.

Affected components

  • @swamp/s3-datastore — the hydrate code path is purely index-driven (see s3_cache_sync.ts pullChanged() / pullIndex()); there is no "if remote index is missing, list the bucket" fallback.
  • Any swamp install joining a bucket whose contents pre-date the indexed-sync model — including buckets bootstrapped via @swamp/s3-datastore-bootstrap whose first-write workload bypassed the indexed path.

Fix approach (high-level)

Add a fallback inside pullIndex() (or a new discoverUnindexed() step invoked from setup after a 404 on .datastore-index.json):

  • Paginated ListObjectsV2 walk of the bucket.
  • Skip isInternalCacheFile entries.
  • Build an index from the listing (key, size, ETag — multipart ETag fallback already handled by isMultipartETag).
  • PutObject the new .datastore-index.json so subsequent peers see it without each having to repeat the walk.
  • Then proceed with the normal hydrate against the freshly-built index.

This makes hydrate self-healing for any bucket regardless of how its contents got there, and removes the need for a one-time "rebuild your index" command as a prerequisite for adoption.

Environment

  • swamp version: 20260504.152403.0-sha.60829024
  • @swamp/s3-datastore: 2026.04.28.4 (latest)
  • OS: macOS (Darwin 23.1.0)
  • #220 — workflow run search empty after S3 setup; the hydration step added by its fix is the path this issue extends.
  • #213 — TLS panic during fs→S3 migration; forces --skip-migration onto the same setup invocation that hits this bug.
  • #218 — S3 datastore stale-lock loop (separate sync issue).
02Bog Flow
OPENTRIAGEDIN PROGRESSSHIPPED+ 1 MOREASSIGNED+ 8 MOREREVIEW+ 3 MOREPR_MERGEDSHIPPED

Shipped

5/5/2026, 12:01:19 AM

Click a lifecycle step above to view its details.

03Sludge Pulse
stack72 assigned stack725/4/2026, 5:23:03 PM

Sign in to post a ripple.