search for standard sites pub-search.waow.tech
search zig blog atproto
11
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add docs/README.md overview, fix root README

accessible explainer of how the search engine works — covers keyword
(FTS5), semantic (voyage + turbopuffer), hybrid (RRF), content
extraction challenges, and what's custom vs off-the-shelf.

also fixes ~25k → actual count and v2 format (offset → hasMore) in
root README.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

zzstoatzz 400584ba 3cac527d

+80 -2
+2 -2
README.md
··· 36 36 GET /health 37 37 ``` 38 38 39 - search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, whitewind, or other). use `format=v2` for a wrapped response with `total`, `offset`, and `results` fields. 39 + search returns three entity types: `article` (document in a publication), `looseleaf` (standalone document), `publication` (newsletter itself). each result includes a `platform` field (leaflet, pckt, offprint, greengale, whitewind, or other). use `format=v2` for a wrapped response with `total`, `hasMore`, and `results` fields. 40 40 41 41 **modes**: `keyword` (default) uses FTS5 with BM25 + recency scoring. `semantic` uses voyage embeddings + [turbopuffer](https://turbopuffer.com) ANN. `hybrid` merges both via reciprocal rank fusion. 42 42 ··· 72 72 73 73 ## embeddings 74 74 75 - documents are embedded using Voyage AI's `voyage-4-lite` model (1024 dimensions). the backend automatically generates embeddings for new documents via a background worker — no manual backfill needed. similarity search uses turbopuffer's ANN index for fast nearest-neighbor queries across ~25k documents. 75 + documents are embedded using Voyage AI's `voyage-4-lite` model (1024 dimensions). the backend automatically generates embeddings for new documents via a background worker — no manual backfill needed. similarity search uses turbopuffer's ANN index for fast nearest-neighbor queries.
+78
docs/README.md
··· 1 + # how pub search works 2 + 3 + a search engine for content published on the [AT Protocol](https://atproto.com) — the open network behind [Bluesky](https://bsky.app). it indexes posts from publishing platforms like [leaflet](https://leaflet.pub), [pckt](https://pckt.blog), [offprint](https://offprint.app), [greengale](https://greengale.app), and [whitewind](https://whtwnd.com), all of which use the [standard.site](https://standard.site) schema. 4 + 5 + **live at [pub-search.waow.tech](https://pub-search.waow.tech)** 6 + 7 + ## the big picture 8 + 9 + ``` 10 + ATProto firehose (every post, everywhere) 11 + ↓ filtered by collection 12 + tap (firehose consumer) 13 + ↓ documents + publications 14 + backend (zig) 15 + ├── turso (cloud sqlite — source of truth) 16 + ├── local sqlite replica (fast keyword search via FTS5) 17 + ├── voyage AI embeddings → turbopuffer (semantic search) 18 + └── HTTP API 19 + 20 + static frontend (cloudflare pages) 21 + ``` 22 + 23 + content flows in one direction: the firehose broadcasts every AT Protocol event in real-time, [tap](https://github.com/bluesky-social/indigo/tree/main/cmd/tap) filters for publishing-related records, and the backend indexes them. 24 + 25 + ## how searching works 26 + 27 + there are three search modes, each using different technology: 28 + 29 + ### keyword search 30 + 31 + uses [SQLite FTS5](https://www.sqlite.org/fts5.html) — a built-in full-text search engine. when a document is indexed, FTS5 builds an inverted index (a map from every word to every document containing it). queries use [BM25](https://en.wikipedia.org/wiki/Okapi_BM25) ranking — a standard relevance scoring algorithm that considers term frequency and document length. recent documents get a small boost. 32 + 33 + this is not something custom — FTS5 is a well-established tool built into SQLite. the custom part is building the index (deciding what to index, how to tokenize, how to rank) and the query syntax (OR between terms for recall, prefix matching on the last word for a type-ahead feel). 34 + 35 + keyword search runs against a **local SQLite replica** on the same machine as the backend, not over the network to the database. this keeps latency around ~9ms. 36 + 37 + ### semantic search 38 + 39 + uses [Voyage AI](https://voyageai.com) embeddings (voyage-4-lite, 1024 dimensions) to convert text into vectors — arrays of numbers that capture meaning. similar texts produce similar vectors, even if they don't share any words. 40 + 41 + these vectors are stored in [turbopuffer](https://turbopuffer.com), a vector database optimized for approximate nearest-neighbor (ANN) search. when you search semantically, your query is embedded into a vector, and turbopuffer finds the documents whose vectors are closest. 42 + 43 + this is how a search for `"loosely about cooking"` can find a post titled `"my grandmother's kitchen"` — keyword search would miss it entirely because the words don't overlap, but the meaning is close. 44 + 45 + ### hybrid search 46 + 47 + runs both keyword and semantic in parallel, then merges results using [reciprocal rank fusion](https://plg.uwaterloo.ca/~gvcormac/cormacksigir09-rrf.pdf) (RRF, k=60). documents found by both methods rank highest. each result is annotated with its source: `"keyword"`, `"semantic"`, or `"keyword+semantic"`. 48 + 49 + ## the content extraction problem 50 + 51 + every platform on standard.site stores document content differently. this is the most fiddly part of the system. 52 + 53 + - **pckt, offprint, greengale** provide a `textContent` field with pre-flattened plaintext — easy 54 + - **leaflet** omits `textContent` to save record size. content lives nested inside `content.pages[].blocks[].block.plaintext` — requires block-by-block extraction 55 + - **whitewind** stores markdown directly in a `content` string field 56 + 57 + the backend handles all of this in the [content extraction](content-extraction.md) layer, producing a uniform plaintext blob for indexing regardless of source platform. 58 + 59 + ## what's custom vs off-the-shelf 60 + 61 + | component | off-the-shelf | custom | 62 + |-----------|---------------|--------| 63 + | full-text matching | SQLite FTS5 (BM25 ranking, inverted index) | query construction, tokenization rules, recency scoring | 64 + | vector similarity | Voyage AI (embeddings), turbopuffer (ANN search) | hybrid fusion, result merging, snippet extraction | 65 + | firehose sync | tap (from bluesky-social/indigo) | content extraction per platform, deduplication | 66 + | data storage | Turso (cloud SQLite), local SQLite replica | schema design, sync logic, migration handling | 67 + | frontend | Cloudflare Pages (hosting) | the entire UI and search experience | 68 + 69 + the tools are popular and well-established. the assembly — wiring the firehose to content extraction to multi-modal search across heterogeneous publishing platforms — is very custom. 70 + 71 + ## further reading 72 + 73 + - [search-architecture.md](search-architecture.md) — FTS5 details, scaling considerations, future options 74 + - [content-extraction.md](content-extraction.md) — how content is extracted from each platform 75 + - [api.md](api.md) — API endpoint reference 76 + - [tap.md](tap.md) — firehose consumer setup, debugging, memory tuning 77 + - [turso-hrana.md](turso-hrana.md) — Turso's HTTP protocol for database queries 78 + - [performance-saga.md](performance-saga.md) — a debugging story about latency spikes