a love letter to tangled (android, iOS, and a search API)
19
fork

Configure Feed

Select the types of activity you want to include in your feed.

doc: search stabilization

+767 -41
+1
.gitignore
··· 21 21 /.idea 22 22 /.ionic 23 23 /.sass-cache 24 + /.sandbox 24 25 /.sourcemaps 25 26 /.versions 26 27 /.vscode/*
+9 -1
docs/README.md
··· 13 13 Forward-looking designs for remaining work. 14 14 15 15 - [`specs/data-sources.md`](specs/data-sources.md) — Constellation, Tangled XRPC, Tap, AT Protocol, Bluesky OAuth 16 - - [`specs/search.md`](specs/search.md) — Keyword, semantic, and hybrid search 16 + - [`specs/search.md`](specs/search.md) — Search stabilization, indexing, activity cache, and later ranking work 17 17 - [`specs/app-features.md`](specs/app-features.md) — Remaining mobile app features 18 + 19 + ## ADR Research 20 + 21 + Focused option analysis for pending architectural decisions. 22 + 23 + - [`adr/pg.md`](adr/pg.md) — PostgreSQL as a production backend option for Twister search 24 + - [`adr/turso.md`](adr/turso.md) — Turso/libSQL as a production backend option for Twister search 25 + - [`adr/storage.md`](adr/storage.md) — Accepted production storage decision for Twister search 18 26 19 27 ## Roadmap 20 28
+126
docs/adr/pg.md
··· 1 + --- 2 + title: ADR Research - PostgreSQL For Production Search 3 + updated: 2026-03-25 4 + status: research 5 + --- 6 + 7 + ## Summary 8 + 9 + PostgreSQL is the strongest production candidate if Twister needs a conventional server database with mature operational tooling, strong concurrent write handling, and a full-text search system designed for configurable ranking and long-lived production workloads. 10 + 11 + It is not the cheapest path from the current codebase. Moving to PostgreSQL would require rewriting the current SQLite/libSQL search layer, migrations, and indexing queries. 12 + 13 + ## Why Consider It 14 + 15 + Twister's current search hardening plan is explicitly about reducing experimentation overhead. That does not answer the production backend question. PostgreSQL should stay in scope because it offers: 16 + 17 + - native full-text search primitives (`tsvector`, `tsquery`) 18 + - ranking and highlighting support 19 + - GIN indexes for inverted-index style search 20 + - mature backup, restore, replication, and managed-hosting options 21 + - familiar operational patterns for multi-process or multi-host deployments 22 + 23 + ## Fit For Twister 24 + 25 + ### Strengths 26 + 27 + #### Good Match For Write-Heavy Ingest 28 + 29 + Twister has several write-producing paths: 30 + 31 + - Tap ingestion 32 + - read-through indexing jobs 33 + - JetStream activity cache writes 34 + - future embedding and reindex jobs 35 + 36 + PostgreSQL is designed for this class of service. If Twister eventually runs API, workers, and background consumers as separate production processes, PostgreSQL is the least surprising option. 37 + 38 + #### Mature Full-Text Search 39 + 40 + PostgreSQL's text search stack includes: 41 + 42 + - document parsing and query parsing 43 + - configurable dictionaries and stop-word handling 44 + - ranking functions 45 + - headline generation 46 + - GIN indexes for `tsvector` search 47 + 48 + This means Twister can keep keyword search in-database without depending on a separate search engine. 49 + 50 + #### Operational Predictability 51 + 52 + For a service that may evolve into a more conventional always-on backend, PostgreSQL has the most established story for: 53 + 54 + - backups and point-in-time recovery 55 + - observability 56 + - migration tooling 57 + - managed production hosting 58 + - separation of app processes from storage 59 + 60 + ## Costs And Risks 61 + 62 + ### Migration Cost Is Real 63 + 64 + The current codebase is built around SQLite/libSQL semantics: 65 + 66 + - SQLite FTS5 67 + - current migration files 68 + - current query behavior and ranking assumptions 69 + 70 + Switching to PostgreSQL would require: 71 + 72 + - replacing FTS5 queries with PostgreSQL full-text search queries 73 + - rewriting migrations 74 + - revisiting search ranking and snippets 75 + - re-testing all search filters and pagination behavior 76 + 77 + This is a bigger code migration than staying inside the SQLite/libSQL family. 78 + 79 + ### GIN Needs Operational Tuning 80 + 81 + PostgreSQL GIN indexes are powerful, but they are not free. The docs note that heavy updates can accumulate pending entries and shift cleanup cost into vacuum or foreground cleanup if not tuned carefully. 82 + 83 + For Twister, that means ingestion-heavy workloads would need explicit attention to: 84 + 85 + - autovacuum behavior 86 + - index maintenance settings 87 + - bulk reindex or reembed workflows 88 + 89 + ### Search Behavior Will Change 90 + 91 + Even if the feature set remains the same, PostgreSQL text search will not behave identically to SQLite FTS5. Tokenization, ranking, and snippet behavior will need product-level review. 92 + 93 + ## Repo-Specific Implications 94 + 95 + - The current API already uses `database/sql`, so adding a PostgreSQL driver is straightforward at the connection layer. 96 + - The hard part is the search repository and migration layer, not connection management. 97 + - PostgreSQL would decouple production deployment from single-host shared-disk assumptions introduced by the local experimentation workflow. 98 + 99 + ## When PostgreSQL Is The Better Choice 100 + 101 + Choose PostgreSQL if most of the following become true: 102 + 103 + - Twister needs multiple long-running workers and API instances writing concurrently. 104 + - We want mainstream database operations over SQLite-family deployment tricks. 105 + - Search hardening turns into a durable production service rather than a lightweight adjunct. 106 + - The project can afford a query and migration rewrite now in exchange for simpler long-term production architecture. 107 + 108 + ## When PostgreSQL Is Probably Not Worth It 109 + 110 + Avoid PostgreSQL for now if most of the following are true: 111 + 112 + - The immediate goal is just to stabilize experimentation. 113 + - We want the smallest migration from the current code. 114 + - Single-host or single-writer production remains acceptable. 115 + - Search behavior should stay as close as possible to current SQLite FTS5 behavior. 116 + 117 + ## Recommendation 118 + 119 + PostgreSQL is the strongest long-term production architecture candidate, but not the lowest-friction next step. 120 + 121 + If the near-term goal is production hardening with minimal code churn, PostgreSQL should remain an evaluated option rather than the immediate default. If Twister grows into a multi-process write-heavy service, PostgreSQL likely becomes the cleanest production destination. 122 + 123 + ## Sources 124 + 125 + - [PostgreSQL Full Text Search docs](https://www.postgresql.org/docs/current/textsearch.html) 126 + - [PostgreSQL GIN docs](https://www.postgresql.org/docs/current/gin.html)
+191
docs/adr/storage.md
··· 1 + --- 2 + title: ADR - Choose Turso For Production Search Storage 3 + updated: 2026-03-25 4 + status: accepted 5 + --- 6 + 7 + ## Decision 8 + 9 + Twister will use Turso as the production database backend for search and indexing. 10 + 11 + This decision is based on current project constraints, not on a claim that Turso is universally superior to PostgreSQL. The goal is to ship a production-capable search service with the lowest migration cost from the current codebase while keeping room to revisit the decision later if the workload changes materially. 12 + 13 + ## Context 14 + 15 + Twister's current search hardening work has two separate concerns: 16 + 17 + 1. make local experimentation cheaper and less messy 18 + 2. choose a production backend deliberately instead of letting the experimentation setup turn into production by accident 19 + 20 + The codebase already relies on SQLite/libSQL-style behavior: 21 + 22 + - SQLite FTS5 search 23 + - SQLite-oriented migrations 24 + - `database/sql` access via `github.com/tursodatabase/libsql-client-go` 25 + - local experimentation through `file:` databases 26 + 27 + The production candidates researched were: 28 + 29 + - PostgreSQL 30 + - Turso remote/libSQL 31 + - Turso embedded-replica style deployment 32 + 33 + The supporting research is recorded in: 34 + 35 + - [PostgreSQL research](pg.md) 36 + - [Turso research](turso.md) 37 + 38 + ## Why Turso 39 + 40 + ### 1. Lowest Migration Cost 41 + 42 + Turso preserves the current SQLite/libSQL query model and avoids a full rewrite of: 43 + 44 + - search queries 45 + - migration files 46 + - ranking behavior 47 + - snippet generation behavior 48 + - search regression expectations 49 + 50 + PostgreSQL remains a credible long-term option, but adopting it now would force a larger rewrite at exactly the point where search hardening should focus on ingestion correctness, smoke tests, read-through indexing, and the activity cache. 51 + 52 + ### 2. Best Match For Current Priorities 53 + 54 + The immediate work is not to invent a new search architecture. It is to stabilize: 55 + 56 + - Tap ingestion 57 + - read-through indexing 58 + - JetStream activity caching 59 + - local experimentation workflows 60 + - end-to-end smoke testing 61 + 62 + Turso lets the project do that without changing database families midstream. 63 + 64 + ### 3. Clear Path From Experimentation To Production 65 + 66 + The local `file:` workflow remains the right choice for development and experimentation. For production, the chosen backend family is still Turso, which gives the project a cleaner transition than moving from local SQLite semantics to PostgreSQL semantics all at once. 67 + 68 + ### 4. Embedded Replicas Stay Optional 69 + 70 + This ADR does not require Turso embedded replicas immediately. 71 + 72 + The production choice is Turso as the backend family. The initial production shape can be plain remote libSQL if that is the least risky deployment path. Embedded replicas remain a future optimization if the Go driver and build constraints become acceptable. 73 + 74 + ## Why Not PostgreSQL Right Now 75 + 76 + PostgreSQL was the strongest long-term alternative, but it loses on near-term fit. 77 + 78 + Reasons not to choose it now: 79 + 80 + - it requires rewriting the current FTS5-based search implementation 81 + - it changes search behavior during a hardening phase where behavior stability matters 82 + - it increases migration scope before the ingestion model itself is stabilized 83 + - it solves an architectural future that the project has not yet fully reached 84 + 85 + If Twister later becomes a larger multi-process, write-heavy service with operational requirements that outgrow Turso, PostgreSQL can be reconsidered with better evidence. 86 + 87 + ## Consequences 88 + 89 + ### Positive 90 + 91 + - minimum code churn from the current search implementation 92 + - fastest path to production-capable search hardening 93 + - preserves current SQLite FTS behavior as the baseline 94 + - keeps experimentation and production closer together conceptually 95 + 96 + ### Negative 97 + 98 + - production remains in the SQLite/libSQL family, which may be less conventional than PostgreSQL for some operational teams 99 + - embedded replicas are not a drop-in next step in the current Go setup 100 + - a later move to PostgreSQL would still be a meaningful migration if Twister grows past Turso's sweet spot 101 + 102 + ## Production Shape 103 + 104 + The production recommendation is: 105 + 106 + 1. keep local `file:` databases for experimentation and development 107 + 2. use Turso remote/libSQL as the default production target 108 + 3. evaluate embedded replicas only after the main search-hardening work is stable 109 + 110 + This avoids coupling the production decision to a premature embedded-replica rollout. 111 + 112 + ## Follow-Up Work 113 + 114 + - define the migration path from the experimental local DB to the production Turso database 115 + - document backup and restore procedures for both local experimentation and production 116 + - keep PostgreSQL as a revisit option if production requirements change 117 + - explicitly evaluate embedded replicas later against Go driver and build constraints 118 + 119 + ## Experimental Local DB Procedures 120 + 121 + The experimental local DB is a workflow aid, not a production artifact. 122 + 123 + Operational rules: 124 + 125 + 1. Keep the database file out of git and treat it as disposable. 126 + 2. Use stop-and-copy backups for anything worth preserving. 127 + 3. Prefer restore-or-rebuild over repair if the DB becomes suspect. 128 + 4. Allow the file to grow during active experiments, then compact or delete it afterward. 129 + 130 + The concrete local backup, restore, and disk-growth procedures live in [packages/api/README.md](/Users/owais/Projects/Twisted/packages/api/README.md). 131 + 132 + ## Migration Path To Production Turso 133 + 134 + The migration path is intentionally code-first, not file-first. 135 + 136 + Do not promote `twister-dev.db` directly into production. The experimental DB proves schema, queries, and workflow assumptions, but the production dataset should be rebuilt from authoritative upstream sources. 137 + 138 + ### Phase 1: Stabilize Local Behavior 139 + 140 + - finalize schema changes in embedded migrations 141 + - validate search behavior locally 142 + - validate smoke tests against the local workflow 143 + 144 + Exit condition: 145 + 146 + - a fresh local database can be created from migrations and pass the smoke-test baseline 147 + 148 + ### Phase 2: Prepare Turso Production Target 149 + 150 + - provision the production Turso database 151 + - enable the required SQLite/libSQL features used by Twister 152 + - configure production credentials and environment variables 153 + - verify migrations apply cleanly to an empty production-shaped database 154 + 155 + Exit condition: 156 + 157 + - Twister can start against an empty Turso database and complete migrations successfully 158 + 159 + ### Phase 3: Rebuild The Dataset From Sources Of Truth 160 + 161 + - start the indexer against Turso 162 + - use Tap backfill and repo-resync paths to rebuild the searchable corpus 163 + - let read-through indexing fill misses during verification 164 + - build the JetStream activity cache from a recent timestamp cursor rather than from copied local state 165 + 166 + Exit condition: 167 + 168 + - the production Turso dataset is populated from Tap, repo recovery paths, and API-triggered indexing rather than from a copied experimental DB file 169 + 170 + ### Phase 4: Verify And Cut Over 171 + 172 + - run the API smoke scripts against the Turso-backed environment 173 + - confirm health, search, document fetches, indexing, and activity cache behavior 174 + - switch app traffic only after the smoke-test baseline passes 175 + 176 + Exit condition: 177 + 178 + - production traffic points at the Turso-backed deployment and the local experimental DB is no longer part of the serving path 179 + 180 + ## Explicit Non-Goal For Migration 181 + 182 + The migration plan does not include a direct file copy from local SQLite to production Turso as the default rollout path. If a one-off import becomes necessary later, it should be treated as a separate migration task with its own validation steps. 183 + 184 + ## Revisit Conditions 185 + 186 + Re-open this ADR if any of the following become true: 187 + 188 + - Twister needs multiple high-write production workers across separate hosts 189 + - operational requirements start favoring standard PostgreSQL tooling over libSQL continuity 190 + - embedded replicas prove impractical in the Go runtime the project wants to keep 191 + - semantic and hybrid search work introduces storage requirements that fit PostgreSQL materially better
+121
docs/adr/turso.md
··· 1 + --- 2 + title: ADR Research - Turso For Production Search 3 + updated: 2026-03-25 4 + status: research 5 + --- 6 + 7 + ## Summary 8 + 9 + Turso is the lowest-migration production candidate because Twister already uses libSQL/SQLite-style storage and query patterns. It preserves the current mental model and minimizes rewrite cost. 10 + 11 + The open question is not whether Turso can work, but which Turso mode fits production: 12 + 13 + - remote libSQL primary 14 + - local experimentation via plain `file:` SQLite 15 + - Turso embedded-replica style local-read, remote-sync patterns 16 + 17 + ## Why Consider It 18 + 19 + Twister already depends on: 20 + 21 + - `github.com/tursodatabase/libsql-client-go` 22 + - SQLite-style migrations 23 + - SQLite FTS5 behavior 24 + 25 + That makes Turso the shortest path from current code to a production-capable deployment. 26 + 27 + ## Fit For Twister 28 + 29 + ### Strengths 30 + 31 + #### Lowest Rewrite Cost 32 + 33 + Staying with Turso/libSQL keeps Twister in the same family of database semantics it already uses. Compared with PostgreSQL, this means less work in: 34 + 35 + - search query rewrites 36 + - migration rewrites 37 + - ranking behavior drift 38 + - compatibility testing 39 + 40 + #### Good Match For Local Experimentation 41 + 42 + The current hardening plan already relies on local `file:` workflows to reduce the messiness and cost of experimentation. Turso and libSQL naturally support this style of development. 43 + 44 + #### Embedded-Replica Model Is Relevant 45 + 46 + Turso's embedded replica story is directly relevant to Twister's workload because it allows: 47 + 48 + - local reads from a file-backed database 49 + - sync to a remote primary 50 + - read-your-writes behavior for the initiating replica 51 + - periodic background sync 52 + 53 + On paper, this is a strong match for a search service that wants cheap local reads while keeping a remote production database. 54 + 55 + ## Costs And Risks 56 + 57 + ### Remote Turso Alone Does Not Solve The Current Pain 58 + 59 + The current problem statement came from burning reads and writes during experimentation. A plain remote Turso deployment keeps the same basic cost surface, even if production operations are cleaner than ad hoc local experiments. 60 + 61 + ### Embedded Replicas Have Important Caveats 62 + 63 + Turso's embedded replicas are promising, but the docs call out constraints that matter for Twister: 64 + 65 + - they require a real filesystem 66 + - they are not suitable for serverless environments without disk 67 + - local DB files should not be opened while syncing 68 + - sync behavior can amplify writes because replication is frame-based 69 + 70 + This means the operational model has to be chosen carefully. It is not a free "best of both worlds" switch. 71 + 72 + ### Current Go Stack Makes The Best Turso Story Harder 73 + 74 + This is the biggest repo-specific caveat. 75 + 76 + The Turso Go quickstart notes that `github.com/tursodatabase/libsql-client-go/libsql` does not support embedded replicas. Twister currently uses that library for remote libSQL access, while local file mode is handled separately with `modernc.org/sqlite`. 77 + 78 + Twister also currently builds with `CGO_ENABLED=0` in `packages/api/justfile`. 79 + 80 + That means the cleanest embedded-replica path may require: 81 + 82 + - changing drivers 83 + - reconsidering the pure-Go build constraint 84 + - accepting CGO in production builds, or waiting for a better pure-Go story 85 + 86 + So while Turso embedded replicas are attractive in principle, they are not a drop-in upgrade for the current codebase. 87 + 88 + ## Repo-Specific Implications 89 + 90 + - Remote Turso/libSQL is the easiest production continuation of the current code. 91 + - Local `file:` mode is already useful for stabilizing experimentation. 92 + - Embedded replicas are strategically interesting but would likely force deeper driver and build changes than the current roadmap implies. 93 + 94 + ## When Turso Is The Better Choice 95 + 96 + Choose Turso if most of the following are true: 97 + 98 + - minimizing migration cost is the top priority 99 + - preserving current SQLite FTS behavior matters 100 + - production can tolerate a simpler deployment model, especially early on 101 + - We want to keep search and experimentation close to the current implementation 102 + 103 + ## When Turso Needs Extra Caution 104 + 105 + Be careful with Turso if most of the following are true: 106 + 107 + - Twister needs multiple production writers across separate hosts 108 + - the system must avoid CGO and keep pure-Go builds 109 + - We expect embedded replicas to be a near-term production feature 110 + - operational simplicity matters more than minimizing query rewrites 111 + 112 + ## Recommendation 113 + 114 + Turso is the best near-term production candidate if the goal is minimum code churn and continuity with the current search stack. 115 + 116 + Remote Turso is the easiest short path. Embedded replicas are the most interesting medium-term Turso option, but they should be treated as additional engineering work rather than an assumption, especially given the current Go driver and build setup. 117 + 118 + ## Sources 119 + 120 + - [Turso Go quickstart](https://docs.turso.tech/sdk/go/quickstart#local-only) 121 + - [Turso embedded replicas docs](https://docs.turso.tech/features/embedded-replicas)
+35 -1
docs/roadmap.md
··· 1 1 --- 2 2 title: Roadmap 3 - updated: 2026-03-24 3 + updated: 2026-03-25 4 4 --- 5 5 6 + ## API: Search Stabilization 7 + 8 + Highest priority. This work blocks further investment in semantic search, hybrid ranking, and broader discovery features. 9 + 10 + - [ ] Stabilize local development and experimentation around a local `file:` database 11 + - [x] Document backup, restore, and disk-growth procedures for the experimental local DB 12 + - [x] Research production backend options: PostgreSQL, Turso remote/libSQL, and Turso embedded replicas 13 + - [x] Write a production storage decision record with workload and operational tradeoffs, using `docs/adr/pg.md` and `docs/adr/turso.md` 14 + - [x] Define the migration path from the experimental local setup to the chosen production backend 15 + - [ ] Add cURL smoke tests for `healthz`, `readyz`, `search`, `documents`, indexing, and activity in `scripts/api/` 16 + - desertthunder.dev DID: `did:plc:xg2vq45muivyy3xwatcehspu` 17 + - Twisted AT URI: `at://did:plc:xg2vq45muivyy3xwatcehspu/sh.tangled.repo/3mho6hukiei22` 18 + - Profile AT URI: `at://did:plc:xg2vq45muivyy3xwatcehspu/sh.tangled.actor.profile/self` 19 + - Follow AT URI (desertthunder.dev follows npmx): `at://did:plc:xg2vq45muivyy3xwatcehspu/sh.tangled.graph.follow/3mhofstanru22` 20 + - Star AT URI (desertthunder.dev stars microcosm-rs): `at://did:plc:lulmyldiq4sb2ikags5sfb25/sh.tangled.repo/3lvsxzinfz222` 21 + - ~~Add `just` targets for smoke-test runs locally and against a remote base URL~~ directly invoking the scripts is fine. 22 + - [ ] Add a durable read-through indexing job queue for records fetched through the API 23 + - [ ] Reuse the existing normalization and upsert path for on-demand indexing jobs 24 + - [ ] Trigger indexing jobs from repo, issue, PR, profile, and similar fetch handlers 25 + - [ ] Add dedupe, retries, and observability for indexing jobs 26 + - [ ] Add a JetStream cache consumer with a persisted timestamp cursor 27 + - [ ] Seed the JetStream cursor to `now - 24h` on first boot and rewind slightly on reconnect 28 + - [ ] Store and serve bounded recent activity from the local cache 29 + - [ ] Keep Tap as the authoritative indexing and bulk backfill path 30 + - [ ] Define a controlled backfill and repo-resync playbook for recovery 31 + 6 32 ## API: Constellation Integration 7 33 8 34 Add a Constellation client to the Go API for enriching search results with social signals. ··· 17 43 18 44 Nomic Embed Text v1.5 via Railway template, async embedding pipeline. 19 45 46 + **Blocked on:** API: Search Stabilization 47 + 20 48 - [ ] Deploy nomic-embed Railway template (`POST /api/embeddings` with Bearer auth) 21 49 - [ ] Embedding client in Go API (`internal/embedding/`) calling the Nomic service 22 50 - [ ] Embed-worker: consume `embedding_jobs` queue, generate 768-dim vectors, store in `document_embeddings` ··· 26 54 ## API: Hybrid Search 27 55 28 56 Combine keyword and semantic results. 57 + 58 + **Blocked on:** API: Search Stabilization, API: Semantic Search Pipeline 29 59 30 60 - [ ] Score normalization (keyword BM25 → [0,1], semantic cosine → [0,1]) 31 61 - [ ] Weighted merge (0.65 keyword + 0.35 semantic, configurable) ··· 34 64 35 65 ## API: Search Quality 36 66 67 + **Blocked on:** API: Search Stabilization 68 + 37 69 - [ ] Field weight tuning based on real queries 38 70 - [ ] Recency boost for recently updated content 39 71 - [ ] Star count ranking signal (via Constellation) ··· 42 74 - [ ] Relevance test fixtures 43 75 44 76 ## API: Observability 77 + 78 + **Depends on:** API: Search Stabilization 45 79 46 80 - [ ] Structured metrics: ingestion rate, search latency, embedding throughput 47 81 - [ ] Dashboard or log-based monitoring
+36 -8
docs/specs/data-sources.md
··· 1 1 --- 2 2 title: Data Sources & Integration 3 - updated: 2026-03-24 3 + updated: 2026-03-25 4 4 --- 5 5 6 - Twisted pulls data from four external sources and authenticates users via Bluesky OAuth. Each source has a distinct role — no single source is authoritative for everything. 6 + Twisted pulls data from five external sources and authenticates users via Bluesky OAuth. Each source has a distinct role — no single source is authoritative for everything. 7 7 8 8 ## Source Overview 9 9 10 - | Source | What it provides | Access pattern | 11 - | ------------------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------- | 12 - | **Tangled XRPC (Knots)** | Git data — file trees, blobs, commits, branches, diffs, tags | Direct XRPC calls to the knot hosting each repo | 13 - | **AT Protocol (PDS)** | User records — profiles, repos, issues, PRs, comments, stars, follows | `com.atproto.repo.getRecord` / `listRecords` on user's PDS | 14 - | **Constellation** | Social signals — star counts, follower counts, reaction counts, backlink lists | Public JSON API at `constellation.microcosm.blue` | 15 - | **Tap** | Real-time firehose of AT Protocol record events for indexing | WebSocket consumer, feeds our search index | 10 + | Source | What it provides | Access pattern | 11 + | --- | --- | --- | 12 + | **Tangled XRPC (Knots)** | Git data — file trees, blobs, commits, branches, diffs, tags | Direct XRPC calls to the knot hosting each repo | 13 + | **AT Protocol (PDS)** | User records — profiles, repos, issues, PRs, comments, stars, follows | `com.atproto.repo.getRecord` / `listRecords` on user's PDS | 14 + | **Constellation** | Social signals — star counts, follower counts, reaction counts, backlink lists | Public JSON API at `constellation.microcosm.blue` | 15 + | **Tap** | Real-time firehose of AT Protocol record events for authoritative indexing | WebSocket consumer, feeds the search index | 16 + | **JetStream** | Recent JSON activity stream for cached feed data | WebSocket consumer, feeds a bounded recent-activity cache | 16 17 17 18 ## Constellation 18 19 ··· 114 115 115 116 Stars, followers, reactions — Constellation handles counts and lists. We still process these events for graph discovery but don't need to maintain our own counters. 116 117 118 + ### Role In The Search Plan 119 + 120 + Tap remains the authoritative ingestion and backfill path for searchable documents. If search correctness depends on complete historical coverage, Tap or a repo resync path is the right source. 121 + 117 122 ### Tap Protocol 118 123 119 124 - WebSocket connection with cursor-based resume 120 125 - Events contain: operation (create/update/delete), DID, collection, rkey, CID, record payload 121 126 - Acks required after processing each event 122 127 - Backfill via `/repos/add` endpoint to request historical data for specific users 128 + 129 + ## JetStream 130 + 131 + JetStream is a lighter JSON stream derived from the firehose. It is useful for recent activity and developer ergonomics, but it is not the authoritative source for search indexing. 132 + 133 + ### Usage In Twisted 134 + 135 + - Recent activity cache for the Activity tab 136 + - Collection-filtered stream for `sh.tangled.*` events 137 + - Cursor-based resume using event timestamps 138 + 139 + ### Constraints 140 + 141 + - Use JetStream for recent, cached activity only 142 + - Do not rely on it as the only historical backfill mechanism 143 + - Keep retention bounded and reconnect idempotent 144 + 145 + ### Role In The Search Plan 146 + 147 + - Seed the cursor to roughly 24 hours ago on first boot 148 + - Persist the last processed timestamp and rewind slightly on reconnect 149 + - Cache normalized activity locally so clients do not each need a raw upstream stream 150 + - Keep Tap as the source of truth for search indexing and bulk backfill 123 151 124 152 ## Bluesky OAuth 125 153
+162 -28
docs/specs/search.md
··· 1 1 --- 2 2 title: Search 3 - updated: 2026-03-24 3 + updated: 2026-03-25 4 4 --- 5 5 6 - Search lets users find repos, issues, PRs, profiles, and code snippets across the Tangled network. The API supports three modes with progressive capability. 6 + > Warning: this document is pretty long. Look at the roadmap and ADR summaries for a 7 + > high-level overview, or jump to the relevant sections. 8 + 9 + Search now has two phases: 10 + 11 + 1. Stabilize indexing and activity caching so search is cheap and reliable. 12 + 2. Resume semantic and hybrid work only after the base pipeline is stable. 13 + 14 + ## Immediate Priority 15 + 16 + The current highest-priority search work is operational, not ranking: 17 + 18 + - Stabilize experimentation around a local `file:` database workflow. 19 + - Add cURL smoke tests for search, document fetches, indexing, and activity reads. 20 + - Enqueue background indexing when the API fetches records that are not yet searchable. 21 + - Cache recent JetStream activity server-side with a persisted 24-hour cursor. 22 + 23 + Production storage is Turso cloud. The reasoning is recorded in `docs/adr/storage.md`, with the comparison inputs in `docs/adr/pg.md` and `docs/adr/turso.md`. 24 + 25 + These tasks block further work on semantic and hybrid search. 26 + 27 + ## Planning Decisions 28 + 29 + ### Why This Comes First 30 + 31 + Search quality is currently constrained more by ingestion cost and freshness gaps than by ranking quality. 32 + The next iteration should make Twister cheaper to operate, resilient across restarts, and able to backfill misses on demand before any new semantic or hybrid work. 33 + 34 + ### Resolved Questions 35 + 36 + #### Local-Only Storage 37 + 38 + Twister can already run against a local `file:` database. That is useful for stabilizing development and experimentation while the indexing model is still changing. It should not automatically be treated as the final production architecture. 39 + 40 + The production storage question remains open and should compare at least: 41 + 42 + - PostgreSQL with native full-text search and conventional operational tooling 43 + - Turso remote/libSQL 44 + - Turso with embedded replicas or similar local-read, remote-sync patterns 45 + 46 + That comparison has been completed, and the current production choice is Turso. 47 + 48 + #### Tangled First-Commit Timestamp 49 + 50 + The first Tangled commit timestamp is useful as a lower-bound hint for one-time experiments, but it should not become the default replay cursor. 51 + JetStream has to default to recent history (< 72 hours from now is what's possible) so bootstrap cost stays bounded. 7 52 8 - ## Modes 53 + #### Tap Versus JetStream 54 + 55 + Tap remains the authoritative indexing and bulk backfill path. JetStream should power only a bounded recent-activity cache. 56 + Read-through API indexing closes gaps when a user fetches a record before Tap has delivered it. 57 + 58 + ## Goals 59 + 60 + - Reduce search-related reads and writes enough that remote Turso cost is no longer the dominant constraint. 61 + - Keep indexed content fresh enough for browsing and search without requiring a full-network rebuild after routine restarts. 62 + - Serve recent activity cheaply from a local cache. 63 + - Add a smoke-test layer that verifies search and indexing behavior end to end. 64 + 65 + ## Current Search Mode 9 66 10 67 ### Keyword Search (Implemented) 11 68 12 - Full-text search powered by SQLite FTS5 with BM25 scoring. Queries are tokenized, matched against title, body, summary, repo name, author handle, and tags. Results are ranked by relevance with field-specific weights (title highest, then author handle, summary, body). 69 + Full-text search is powered by SQLite FTS5 with BM25 scoring. Queries match title, body, summary, repo name, author handle, and tags. Results are ranked with field-specific weights and snippets highlight matches with `<mark>` tags. 13 70 14 - Snippets are generated from the body field with match terms wrapped in `<mark>` tags. 71 + ## Stabilization Plan 15 72 16 - ### Semantic Search (Planned) 73 + ### Storage 17 74 18 - Vector similarity search using **Nomic Embed Text v1.5**, deployed on Railway via the [nomic-embed template](https://railway.com/deploy/nomic-embed). The template runs Ollama behind an authenticated Caddy proxy. 75 + Twister should use a local `file:` database to stabilize experimentation and reduce the messiness of iteration while the indexing pipeline is being hardened. Production storage should remain explicitly undecided until the project compares PostgreSQL and Turso-based options against the final workload. 19 76 20 - **Embedding service:** 77 + Requirements: 21 78 22 - - Model: `nomic-embed-text:latest` (8192-token context, 768-dimensional vectors, Matryoshka support for variable dimensionality) 23 - - Endpoint: `POST /api/embeddings` with Bearer token auth 24 - - Request: `{ "model": "nomic-embed-text:latest", "prompt": "text to embed" }` 25 - - Deployed as a separate Railway service alongside the API and indexer 79 + - keep local-file mode as the simplest path for development and experimentation 80 + - document what assumptions the local path makes about single-host or shared-disk execution 81 + - document backup, restore, and disk-growth procedures 82 + - produce a production storage decision record comparing PostgreSQL and Turso options, starting from `docs/adr/pg.md` and `docs/adr/turso.md` 26 83 27 - **Pipeline:** 84 + Evaluation criteria for the production decision: 28 85 29 - - The embed-worker consumes the `embedding_jobs` queue, calls the Nomic Embed service, and stores 768-dim vectors in the `document_embeddings` table 30 - - Documents are embedded asynchronously after indexing — the embed-worker runs independently of the ingestion loop 31 - - Search queries are embedded at request time (single prompt, low latency) 32 - - Vectors are matched via DiskANN cosine similarity index in Turso 86 + - write-heavy ingestion behavior 87 + - FTS quality and indexing ergonomics 88 + - operational complexity and backup story 89 + - latency for reads and writes 90 + - failure recovery and restore workflow 91 + - support for future semantic search requirements 33 92 34 - ### Hybrid Search (Planned) 93 + Acceptance: 35 94 36 - Weighted combination of keyword and semantic results. Default blend: 0.65 keyword + 0.35 semantic (configurable). Scores are normalized to [0, 1] before blending. Results are deduplicated by document ID with the higher score retained. Each result includes a `matched_by` field indicating which mode(s) contributed. 95 + - local development no longer depends on remote Turso for routine experimentation 96 + - the production backend choice is documented with explicit tradeoffs 97 + - the chosen production backend has a migration path from the experimental local setup 98 + 99 + The concrete local DB operating procedure lives in `packages/api/README.md`. 100 + The production migration path is documented in `docs/adr/storage.md`. 101 + 102 + ### Read-Through Indexing 103 + 104 + When the API fetches a repo, issue, PR, profile, or similar record directly from upstream, it should enqueue background indexing work if that record is not already searchable. Tap remains the primary ingest path; read-through indexing only closes gaps. 105 + 106 + Requirements: 107 + 108 + - add a durable job table for on-demand indexing 109 + - deduplicate jobs by stable document identity 110 + - reuse the existing normalization and upsert path 111 + - trigger jobs from the handlers that already fetch upstream records 112 + 113 + Acceptance: 114 + 115 + - a fetched-but-missing record becomes searchable shortly after the first successful API read 116 + - repeated page views do not create unbounded duplicate work 117 + - failures are visible through logs and smoke tests 118 + 119 + ### Activity Cache 120 + 121 + JetStream should back a recent-activity cache, not the main search index. The server should persist a timestamp cursor, seed it to `now - ~24h` on first boot, rewind slightly on reconnect, and expire old events aggressively. 122 + 123 + Requirements: 124 + 125 + - add a dedicated activity cache table 126 + - persist a separate JetStream consumer cursor 127 + - seed missing cursors to recent history, not full history 128 + - keep retention bounded by age and row count 129 + 130 + Acceptance: 131 + 132 + - common activity reads can be served from the cache 133 + - restarts resume from the stored timestamp cursor 134 + - reconnects are idempotent and tolerate a short rewind window 135 + 136 + ### Smoke Tests 137 + 138 + Twister needs cURL-based smoke tests covering: 139 + 140 + - `GET /healthz` 141 + - `GET /readyz` 142 + - `GET /search` 143 + - `GET /documents/{id}` 144 + - one fetch path that should enqueue indexing 145 + - one activity endpoint backed by the cache 146 + 147 + Acceptance: 148 + 149 + - one local command can verify the critical API surface 150 + - the same scripts can run against staging or production by changing the base URL 151 + 152 + ## Operational Model 153 + 154 + 1. Tap ingests the authoritative search corpus. 155 + 2. Direct API reads enqueue background indexing for misses. 156 + 3. JetStream fills only the recent-activity cache. 157 + 4. Smoke tests guard the critical paths. 158 + 5. Semantic and hybrid search remain blocked until the base pipeline is stable. 159 + 160 + ## Backfill Strategy 161 + 162 + - Search index backfill should continue to use Tap admin backfill, firehose-driven repo sync, or repo export based resync. 163 + - Activity cache bootstrap should use a recent JetStream timestamp cursor, defaulting to `now - 24h`. 164 + - A manual cursor override can exist for one-time replay experiments, but it should not be the default startup path. 37 165 38 166 ## API Contract 39 167 ··· 95 223 96 224 2. **Constellation supplements search results.** Star counts and follower counts from Constellation can be used as ranking signals without needing to index interaction records ourselves. 97 225 98 - 3. **Semantic search is additive.** It improves discovery for vague queries but isn't required for the app to be useful. It ships when the embedding pipeline is stable. 226 + 3. **Read-through indexing closes freshness gaps.** If a user can fetch a record, the system should be able to make it searchable shortly after. 99 227 100 - 4. **Graceful degradation.** The mobile app treats the search API as optional. If Twister is unavailable, handle-based direct browsing still works. Search results link into the same browsing screens. 228 + 4. **JetStream is for recent activity, not authoritative indexing.** Use it to power the cached feed, not to replace Tap or repo re-sync. 229 + 230 + 5. **Semantic search is additive.** It improves discovery for vague queries but is not required for the app to be useful. 231 + 232 + 6. **Graceful degradation.** The mobile app treats the search API as optional. If Twister is unavailable, handle-based direct browsing still works. Search results link into the same browsing screens. 101 233 102 234 ## Quality Improvements (Planned) 103 235 104 236 - Field weight tuning based on real query patterns 105 237 - Recency boost for recently updated content 106 - - Collection-aware ranking (repos weighted higher for short queries) 107 - - Star count as a ranking signal (via Constellation) 108 - - State filtering (exclude closed issues by default) 109 - - Better snippet generation with longer context windows 110 - - Relevance test fixtures for regression testing 238 + - Collection-aware ranking 239 + - Star count as a ranking signal 240 + - State filtering defaults 241 + - Better snippet generation 242 + - Relevance test fixtures 111 243 112 244 ## Mobile Integration 113 245 114 - The app calls the search API from the Explore tab. Results are displayed in segmented views (repos, users, issues/PRs). Each result links to the corresponding browsing screen (repo detail, profile, issue detail). 246 + The app calls the search API from the Explore tab. Results are displayed in segmented views (repos, users, issues/PRs). 247 + Each result links to the corresponding browsing screen (repo detail, profile, issue detail). 115 248 116 - When the search API is unavailable, the Explore tab shows an appropriate state rather than breaking. The Home tab's handle-based browsing is fully independent of search. 249 + When the search API is unavailable, the Explore tab shows an appropriate state rather than breaking. 250 + The Home tab's handle-based browsing is fully independent of search.
+7 -3
docs/todo.md
··· 1 1 --- 2 2 title: Parking Lot 3 - updated: 2026-03-24 3 + updated: 2026-03-25 4 4 --- 5 5 6 - - Constellation requests would be a good opportunity to dispatch a job to index what 7 - was requested. 6 + Search stabilization is active roadmap work now, not parking-lot work. 7 + 8 + Still parked: 9 + 10 + - Semantic and hybrid search stay deferred until local-only storage, smoke tests, read-through indexing, and the JetStream cache are stable. 11 + - Revisit `com.atproto.sync.listReposByCollection` as a complementary backfill discovery source after the Tap-driven indexing path is reliable.
+79
packages/api/README.md
··· 18 18 19 19 The server listens on `:8080` by default. Logs are printed as text when `--local` is set. 20 20 21 + ## Experimental Local DB Operations 22 + 23 + The experimental local database lives at `packages/api/twister-dev.db` when you run Twister from `packages/api` with `--local`. 24 + 25 + This database is for local experimentation only. Treat it as disposable unless you explicitly back it up. 26 + 27 + ### Backup 28 + 29 + Recommended procedure: 30 + 31 + 1. Stop the Twister process using the local DB. 32 + 2. Copy the database file and any SQLite sidecar files if they exist. 33 + 34 + Example: 35 + 36 + ```sh 37 + cd packages/api 38 + mkdir -p backups 39 + timestamp="$(date +%Y%m%d-%H%M%S)" 40 + cp twister-dev.db "backups/twister-dev-${timestamp}.db" 41 + test -f twister-dev.db-wal && cp twister-dev.db-wal "backups/twister-dev-${timestamp}.db-wal" 42 + test -f twister-dev.db-shm && cp twister-dev.db-shm "backups/twister-dev-${timestamp}.db-shm" 43 + ``` 44 + 45 + For this experimental DB, stop-and-copy is preferred over hot backup complexity. 46 + 47 + ### Restore 48 + 49 + Recommended procedure: 50 + 51 + 1. Stop the Twister process. 52 + 2. Move the current local DB aside if you want to keep it. 53 + 3. Copy the backup file back to `twister-dev.db`. 54 + 4. Restore matching `-wal` and `-shm` files only if they were captured with the same backup set. 55 + 56 + Example: 57 + 58 + ```sh 59 + cd packages/api 60 + mv twister-dev.db "twister-dev.db.broken.$(date +%Y%m%d-%H%M%S)" 2>/dev/null || true 61 + cp backups/twister-dev-YYYYMMDD-HHMMSS.db twister-dev.db 62 + ``` 63 + 64 + After restore, restart Twister and let the app run migrations normally. 65 + 66 + ### Disk Growth 67 + 68 + The local DB will grow during experimentation because of: 69 + 70 + - indexed documents 71 + - FTS tables 72 + - activity cache rows 73 + - repeated backfill or reindex runs 74 + 75 + Recommended operating procedure: 76 + 77 + 1. Check file growth periodically. 78 + 2. Delete and rebuild the experimental DB freely when the dataset is no longer useful. 79 + 3. Run `VACUUM` only when you intentionally want to compact a long-lived local DB. 80 + 4. Keep old backups out of the repo and rotate them manually. 81 + 82 + Example inspection commands: 83 + 84 + ```sh 85 + cd packages/api 86 + du -h twister-dev.db* 87 + ls -lh twister-dev.db* 88 + ``` 89 + 90 + For experimental use, the simplest policy is usually: 91 + 92 + - back up anything worth keeping 93 + - remove the DB when the experiment is over 94 + - let Twister rebuild from migrations and backfill paths 95 + 96 + ### Failure Recovery Rule 97 + 98 + If the experimental DB becomes suspicious or inconsistent, prefer restore-or-rebuild over manual repair. This is a developer convenience database, not the source of truth. 99 + 21 100 ## Environment variables 22 101 23 102 Copy `.env.example` to `.env` in the repo root (or `packages/api/`). The server loads `.env`, `../.env`, and `../../.env` automatically.