search for standard sites pub-search.waow.tech
search zig blog atproto
11
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: update signal collection, remove manual backfill, clean up stack

- tap now signals on site.standard.document (not pub.leaflet.document)
- embeddings are auto-generated by backend worker since a4509a4
- consolidate stack section: Fly hosts Zig backend, Turso with Voyage vectors
- link tap from first mention in "how it works"
- add greengale to known platforms in CLAUDE.md

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

zzstoatzz d3fb6079 8f16fd6f

+5 -26
+1 -2
CLAUDE.md
··· 17 17 - **db**: Turso (source of truth) + local SQLite read replica (FTS queries) 18 18 19 19 ## platforms 20 - - leaflet, pckt, offprint: known platforms (detected via basePath) 20 + - leaflet, pckt, offprint, greengale: known platforms (detected via basePath) 21 21 - other: site.standard.* documents not from a known platform 22 22 23 23 ## search ranking ··· 30 30 - see `docs/tap.md` for memory tuning and debugging 31 31 32 32 ## common tasks 33 - - backfill embeddings: `./scripts/backfill-embeddings` 34 33 - check indexing: `curl -s https://leaflet-search-backend.fly.dev/api/dashboard | jq`
+4 -24
README.md
··· 10 10 11 11 ## how it works 12 12 13 - 1. **tap** syncs content from ATProto firehose (signals on `pub.leaflet.document`, filters `pub.leaflet.*` + `site.standard.*`) 13 + 1. **[tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap)** syncs content from ATProto firehose (signals on `site.standard.document`, filters `pub.leaflet.*` + `site.standard.*`) 14 14 2. **backend** indexes content into SQLite FTS5 via [Turso](https://turso.tech), serves search API 15 15 3. **site** static frontend on Cloudflare Pages 16 16 ··· 60 60 61 61 ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 62 62 63 - - [Fly.io](https://fly.io) hosts backend + tap 64 - - [Turso](https://turso.tech) cloud SQLite with vector support 65 - - [Voyage AI](https://voyageai.com) embeddings (voyage-3-lite) 63 + - [Fly.io](https://fly.io) hosts [Zig](https://ziglang.org) search API and content indexing 64 + - [Turso](https://turso.tech) cloud SQLite with [Voyage AI](https://voyageai.com) vector support 66 65 - [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) syncs content from ATProto firehose 67 - - [Zig](https://ziglang.org) HTTP server, search API, content indexing 68 66 - [Cloudflare Pages](https://pages.cloudflare.com) static frontend 69 67 70 68 ## embeddings 71 69 72 - documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). new documents from the firehose don't automatically get embeddings - they need to be backfilled periodically. 73 - 74 - ### backfill embeddings 75 - 76 - requires `TURSO_URL`, `TURSO_TOKEN`, and `VOYAGE_API_KEY` in `.env`: 77 - 78 - ```bash 79 - # check how many docs need embeddings 80 - ./scripts/backfill-embeddings --dry-run 81 - 82 - # run the backfill (uses batching + concurrency) 83 - ./scripts/backfill-embeddings --batch-size 50 84 - ``` 85 - 86 - the script: 87 - - fetches docs where `embedding IS NULL` 88 - - batches them to Voyage API (50 docs/batch default) 89 - - writes embeddings to Turso in batched transactions 90 - - runs 8 concurrent workers 70 + documents are embedded using Voyage AI's `voyage-3-lite` model (512 dimensions). the backend automatically generates embeddings for new documents via a background worker - no manual backfill needed. 91 71 92 72 **note:** we use brute-force cosine similarity instead of a vector index. Turso's DiskANN index has ~60s write latency per row, making it impractical for incremental updates. brute-force on 3500 vectors runs in ~0.15s which is fine for this scale.