search for standard sites pub-search.waow.tech
search zig blog atproto
11
fork

Configure Feed

Select the types of activity you want to include in your feed.

docs: add constellation and bridgy fed documentation

- README: add constellation section with link and brief description
- docs/constellation.md: data pipeline, frontend, recomputation, future work
- docs/bridgy-fed.md: history of two failed attempts, why we exclude,
detection method, scripts, and what we'd need to reconsider

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

+98
+8
README.md
··· 61 61 62 62 the backend indexes multiple ATProto platforms - currently `pub.leaflet.*` and `site.standard.*` collections. platform is stored per-document and returned in search results. 63 63 64 + ## constellation 65 + 66 + a 2D semantic map of the entire document index: [pub-search.waow.tech/constellation](https://pub-search.waow.tech/constellation) 67 + 68 + documents are projected from 1024-dim voyage embeddings to 2D via PCA → UMAP, then clustered with HDBSCAN at two granularities. each point is colored by platform. zoom in to see finer cluster labels and individual document titles. 69 + 70 + built with `scripts/build-constellation` (batch job, ~20s) → `site/constellation.json` → canvas renderer. see [docs/constellation.md](docs/constellation.md) for details. 71 + 64 72 ## [stack](https://bsky.app/profile/zzstoatzz.io/post/3mbij5ip4ws2a) 65 73 66 74 - [Fly.io](https://fly.io) hosts [Zig](https://ziglang.org) search API and content indexing
+36
docs/bridgy-fed.md
··· 1 + # bridgy fed 2 + 3 + [Bridgy Fed](https://fed.brid.gy) bridges content from the fediverse (Mastodon, etc.) into ATProto. these documents show up with `platform='other'` because they use `site.standard.*` collections but aren't hosted on any known publishing platform. 4 + 5 + ## why we exclude it 6 + 7 + we've tried including bridgy fed content twice. both times it caused problems: 8 + 9 + **attempt 1 (early 2026):** bridgy fed content flooded the index — tens of thousands of short fediverse posts mixed in with long-form articles. search results became polluted with content that wasn't meaningfully "published" in the way leaflet/whitewind/etc. content is. we added `is_bridgyfed` column to turso and marked all bridgy fed documents, then excluded them from search results. 10 + 11 + **attempt 2 (later):** even with search exclusion, the vectors remained in turbopuffer and polluted semantic search and the constellation visualization. had to run `scripts/purge-bridgyfed-vectors` to clean up ~26k orphan vectors. 12 + 13 + **current state:** bridgy fed content is now **dropped at ingest** in the backend. the tap still receives it (can't filter at the firehose level), but the backend's ingest pipeline checks the PDS endpoint and silently drops any DID hosted on `brid.gy`. this is the cleanest solution — no storage, no cleanup needed. 14 + 15 + ## detection 16 + 17 + a DID is bridgy fed if its PDS endpoint (via `plc.directory` resolution) contains `brid.gy`. the scripts resolve this by: 18 + 19 + 1. query turso for distinct DIDs with `platform='other'` 20 + 2. resolve each DID's PDS via `https://plc.directory/{did}` 21 + 3. check if `service[type=AtprotoPersonalDataServer].serviceEndpoint` contains `brid.gy` 22 + 23 + ## scripts 24 + 25 + - `scripts/mark-bridgyfed` — marks existing bridgy fed rows in turso (`is_bridgyfed = 1`). dry run by default, `--apply` to update. 26 + - `scripts/purge-bridgyfed-vectors` — deletes bridgy fed vectors from turbopuffer. loops until all are removed (tpuf caps queries at 10k). dry run by default, `--apply` to delete. 27 + 28 + both scripts use pydantic-settings with `.env` file (dotenv takes priority over environment variables). 29 + 30 + ## if we ever reconsider 31 + 32 + the fundamental issue is that bridgy fed content is qualitatively different from native ATProto publishing — it's short social media posts, not articles/essays. if bridgy fed ever supports long-form content or we want to include fediverse posts, we'd need: 33 + 34 + 1. content-length filtering (min word count or similar) 35 + 2. a separate `platform='fediverse'` designation so users can filter 36 + 3. careful testing of search result quality before and after
+54
docs/constellation.md
··· 1 + # constellation 2 + 3 + 2D semantic map of the document index. each document is a point on a canvas, positioned by semantic similarity and colored by platform. 4 + 5 + **live:** [pub-search.waow.tech/constellation](https://pub-search.waow.tech/constellation) 6 + 7 + ## data pipeline 8 + 9 + `scripts/build-constellation` is a batch python script (uv inline dependencies) that: 10 + 11 + 1. **exports vectors** from turbopuffer — paginated query with `rank_by: ["id", "asc"]`, fetches all ~12k vectors + metadata 12 + 2. **PCA 1024 → 50** — denoising pass, typically captures ~60% variance 13 + 3. **UMAP 50 → 2** — cosine metric, `n_neighbors=15`, `min_dist=0.1`, `random_state=42` 14 + 4. **HDBSCAN** at two granularities on the 2D coordinates: 15 + - coarse: `min_cluster_size=100` (~30 clusters, zoomed-out labels) 16 + - fine: `min_cluster_size=20` (~160 clusters, zoomed-in labels) 17 + - outliers assigned to nearest cluster centroid 18 + 5. **c-TF-IDF** on document titles per cluster → 3-term labels 19 + 6. **outputs** `site/constellation.json` (~3MB, gitignored) 20 + 21 + run time: ~20s. dependencies: `umap-learn`, `hdbscan`, `scikit-learn`, `httpx`, `numpy`, `pydantic-settings`. 22 + 23 + ```bash 24 + ./scripts/build-constellation # writes site/constellation.json 25 + ./scripts/build-constellation -o out.json # custom output path 26 + ``` 27 + 28 + ## frontend 29 + 30 + `site/constellation.html` + `site/constellation.js` + `site/constellation.css` 31 + 32 + - **canvas 2D** renderer — no libraries, sprite-based (pre-rendered offscreen canvas per platform) 33 + - **pan/zoom** via wheel, drag, touch/pinch (max 15×) 34 + - **semantic zoom**: coarse labels → fine labels → document titles as you zoom in 35 + - **hover tooltip** with title, author, platform 36 + - **click** opens document URL 37 + - **theme support**: dark (default), light, system — synced with the rest of the site 38 + 39 + ## recomputing 40 + 41 + the constellation is a point-in-time snapshot. rerun the build script when the index changes meaningfully: 42 + 43 + ```bash 44 + ./scripts/build-constellation 45 + cd site && wrangler pages deploy . --project-name leaflet-search 46 + ``` 47 + 48 + not yet automated — could be a GitHub Action or post-backfill hook. 49 + 50 + ## future work 51 + 52 + - **hierarchical clustering**: replace the two-strata (coarse/fine) approach with a proper hierarchy (Ward linkage on HDBSCAN centroids + `cut_tree` at multiple levels) for smooth fractal zoom 53 + - **LLM-generated labels**: use Claude Haiku to produce coherent cluster names instead of c-TF-IDF keyword soup 54 + - **auto-update**: trigger rebuild after significant index changes