···1313Forward-looking designs for remaining work.
14141515- [`specs/data-sources.md`](specs/data-sources.md) — Constellation, Tangled XRPC, Tap, AT Protocol, Bluesky OAuth
1616-- [`specs/search.md`](specs/search.md) — Keyword, semantic, and hybrid search
1616+- [`specs/search.md`](specs/search.md) — Search stabilization, indexing, activity cache, and later ranking work
1717- [`specs/app-features.md`](specs/app-features.md) — Remaining mobile app features
1818+1919+## ADR Research
2020+2121+Focused option analysis for pending architectural decisions.
2222+2323+- [`adr/pg.md`](adr/pg.md) — PostgreSQL as a production backend option for Twister search
2424+- [`adr/turso.md`](adr/turso.md) — Turso/libSQL as a production backend option for Twister search
2525+- [`adr/storage.md`](adr/storage.md) — Accepted production storage decision for Twister search
18261927## Roadmap
2028
+126
docs/adr/pg.md
···11+---
22+title: ADR Research - PostgreSQL For Production Search
33+updated: 2026-03-25
44+status: research
55+---
66+77+## Summary
88+99+PostgreSQL is the strongest production candidate if Twister needs a conventional server database with mature operational tooling, strong concurrent write handling, and a full-text search system designed for configurable ranking and long-lived production workloads.
1010+1111+It is not the cheapest path from the current codebase. Moving to PostgreSQL would require rewriting the current SQLite/libSQL search layer, migrations, and indexing queries.
1212+1313+## Why Consider It
1414+1515+Twister's current search hardening plan is explicitly about reducing experimentation overhead. That does not answer the production backend question. PostgreSQL should stay in scope because it offers:
1616+1717+- native full-text search primitives (`tsvector`, `tsquery`)
1818+- ranking and highlighting support
1919+- GIN indexes for inverted-index style search
2020+- mature backup, restore, replication, and managed-hosting options
2121+- familiar operational patterns for multi-process or multi-host deployments
2222+2323+## Fit For Twister
2424+2525+### Strengths
2626+2727+#### Good Match For Write-Heavy Ingest
2828+2929+Twister has several write-producing paths:
3030+3131+- Tap ingestion
3232+- read-through indexing jobs
3333+- JetStream activity cache writes
3434+- future embedding and reindex jobs
3535+3636+PostgreSQL is designed for this class of service. If Twister eventually runs API, workers, and background consumers as separate production processes, PostgreSQL is the least surprising option.
3737+3838+#### Mature Full-Text Search
3939+4040+PostgreSQL's text search stack includes:
4141+4242+- document parsing and query parsing
4343+- configurable dictionaries and stop-word handling
4444+- ranking functions
4545+- headline generation
4646+- GIN indexes for `tsvector` search
4747+4848+This means Twister can keep keyword search in-database without depending on a separate search engine.
4949+5050+#### Operational Predictability
5151+5252+For a service that may evolve into a more conventional always-on backend, PostgreSQL has the most established story for:
5353+5454+- backups and point-in-time recovery
5555+- observability
5656+- migration tooling
5757+- managed production hosting
5858+- separation of app processes from storage
5959+6060+## Costs And Risks
6161+6262+### Migration Cost Is Real
6363+6464+The current codebase is built around SQLite/libSQL semantics:
6565+6666+- SQLite FTS5
6767+- current migration files
6868+- current query behavior and ranking assumptions
6969+7070+Switching to PostgreSQL would require:
7171+7272+- replacing FTS5 queries with PostgreSQL full-text search queries
7373+- rewriting migrations
7474+- revisiting search ranking and snippets
7575+- re-testing all search filters and pagination behavior
7676+7777+This is a bigger code migration than staying inside the SQLite/libSQL family.
7878+7979+### GIN Needs Operational Tuning
8080+8181+PostgreSQL GIN indexes are powerful, but they are not free. The docs note that heavy updates can accumulate pending entries and shift cleanup cost into vacuum or foreground cleanup if not tuned carefully.
8282+8383+For Twister, that means ingestion-heavy workloads would need explicit attention to:
8484+8585+- autovacuum behavior
8686+- index maintenance settings
8787+- bulk reindex or reembed workflows
8888+8989+### Search Behavior Will Change
9090+9191+Even if the feature set remains the same, PostgreSQL text search will not behave identically to SQLite FTS5. Tokenization, ranking, and snippet behavior will need product-level review.
9292+9393+## Repo-Specific Implications
9494+9595+- The current API already uses `database/sql`, so adding a PostgreSQL driver is straightforward at the connection layer.
9696+- The hard part is the search repository and migration layer, not connection management.
9797+- PostgreSQL would decouple production deployment from single-host shared-disk assumptions introduced by the local experimentation workflow.
9898+9999+## When PostgreSQL Is The Better Choice
100100+101101+Choose PostgreSQL if most of the following become true:
102102+103103+- Twister needs multiple long-running workers and API instances writing concurrently.
104104+- We want mainstream database operations over SQLite-family deployment tricks.
105105+- Search hardening turns into a durable production service rather than a lightweight adjunct.
106106+- The project can afford a query and migration rewrite now in exchange for simpler long-term production architecture.
107107+108108+## When PostgreSQL Is Probably Not Worth It
109109+110110+Avoid PostgreSQL for now if most of the following are true:
111111+112112+- The immediate goal is just to stabilize experimentation.
113113+- We want the smallest migration from the current code.
114114+- Single-host or single-writer production remains acceptable.
115115+- Search behavior should stay as close as possible to current SQLite FTS5 behavior.
116116+117117+## Recommendation
118118+119119+PostgreSQL is the strongest long-term production architecture candidate, but not the lowest-friction next step.
120120+121121+If the near-term goal is production hardening with minimal code churn, PostgreSQL should remain an evaluated option rather than the immediate default. If Twister grows into a multi-process write-heavy service, PostgreSQL likely becomes the cleanest production destination.
122122+123123+## Sources
124124+125125+- [PostgreSQL Full Text Search docs](https://www.postgresql.org/docs/current/textsearch.html)
126126+- [PostgreSQL GIN docs](https://www.postgresql.org/docs/current/gin.html)
+191
docs/adr/storage.md
···11+---
22+title: ADR - Choose Turso For Production Search Storage
33+updated: 2026-03-25
44+status: accepted
55+---
66+77+## Decision
88+99+Twister will use Turso as the production database backend for search and indexing.
1010+1111+This decision is based on current project constraints, not on a claim that Turso is universally superior to PostgreSQL. The goal is to ship a production-capable search service with the lowest migration cost from the current codebase while keeping room to revisit the decision later if the workload changes materially.
1212+1313+## Context
1414+1515+Twister's current search hardening work has two separate concerns:
1616+1717+1. make local experimentation cheaper and less messy
1818+2. choose a production backend deliberately instead of letting the experimentation setup turn into production by accident
1919+2020+The codebase already relies on SQLite/libSQL-style behavior:
2121+2222+- SQLite FTS5 search
2323+- SQLite-oriented migrations
2424+- `database/sql` access via `github.com/tursodatabase/libsql-client-go`
2525+- local experimentation through `file:` databases
2626+2727+The production candidates researched were:
2828+2929+- PostgreSQL
3030+- Turso remote/libSQL
3131+- Turso embedded-replica style deployment
3232+3333+The supporting research is recorded in:
3434+3535+- [PostgreSQL research](pg.md)
3636+- [Turso research](turso.md)
3737+3838+## Why Turso
3939+4040+### 1. Lowest Migration Cost
4141+4242+Turso preserves the current SQLite/libSQL query model and avoids a full rewrite of:
4343+4444+- search queries
4545+- migration files
4646+- ranking behavior
4747+- snippet generation behavior
4848+- search regression expectations
4949+5050+PostgreSQL remains a credible long-term option, but adopting it now would force a larger rewrite at exactly the point where search hardening should focus on ingestion correctness, smoke tests, read-through indexing, and the activity cache.
5151+5252+### 2. Best Match For Current Priorities
5353+5454+The immediate work is not to invent a new search architecture. It is to stabilize:
5555+5656+- Tap ingestion
5757+- read-through indexing
5858+- JetStream activity caching
5959+- local experimentation workflows
6060+- end-to-end smoke testing
6161+6262+Turso lets the project do that without changing database families midstream.
6363+6464+### 3. Clear Path From Experimentation To Production
6565+6666+The local `file:` workflow remains the right choice for development and experimentation. For production, the chosen backend family is still Turso, which gives the project a cleaner transition than moving from local SQLite semantics to PostgreSQL semantics all at once.
6767+6868+### 4. Embedded Replicas Stay Optional
6969+7070+This ADR does not require Turso embedded replicas immediately.
7171+7272+The production choice is Turso as the backend family. The initial production shape can be plain remote libSQL if that is the least risky deployment path. Embedded replicas remain a future optimization if the Go driver and build constraints become acceptable.
7373+7474+## Why Not PostgreSQL Right Now
7575+7676+PostgreSQL was the strongest long-term alternative, but it loses on near-term fit.
7777+7878+Reasons not to choose it now:
7979+8080+- it requires rewriting the current FTS5-based search implementation
8181+- it changes search behavior during a hardening phase where behavior stability matters
8282+- it increases migration scope before the ingestion model itself is stabilized
8383+- it solves an architectural future that the project has not yet fully reached
8484+8585+If Twister later becomes a larger multi-process, write-heavy service with operational requirements that outgrow Turso, PostgreSQL can be reconsidered with better evidence.
8686+8787+## Consequences
8888+8989+### Positive
9090+9191+- minimum code churn from the current search implementation
9292+- fastest path to production-capable search hardening
9393+- preserves current SQLite FTS behavior as the baseline
9494+- keeps experimentation and production closer together conceptually
9595+9696+### Negative
9797+9898+- production remains in the SQLite/libSQL family, which may be less conventional than PostgreSQL for some operational teams
9999+- embedded replicas are not a drop-in next step in the current Go setup
100100+- a later move to PostgreSQL would still be a meaningful migration if Twister grows past Turso's sweet spot
101101+102102+## Production Shape
103103+104104+The production recommendation is:
105105+106106+1. keep local `file:` databases for experimentation and development
107107+2. use Turso remote/libSQL as the default production target
108108+3. evaluate embedded replicas only after the main search-hardening work is stable
109109+110110+This avoids coupling the production decision to a premature embedded-replica rollout.
111111+112112+## Follow-Up Work
113113+114114+- define the migration path from the experimental local DB to the production Turso database
115115+- document backup and restore procedures for both local experimentation and production
116116+- keep PostgreSQL as a revisit option if production requirements change
117117+- explicitly evaluate embedded replicas later against Go driver and build constraints
118118+119119+## Experimental Local DB Procedures
120120+121121+The experimental local DB is a workflow aid, not a production artifact.
122122+123123+Operational rules:
124124+125125+1. Keep the database file out of git and treat it as disposable.
126126+2. Use stop-and-copy backups for anything worth preserving.
127127+3. Prefer restore-or-rebuild over repair if the DB becomes suspect.
128128+4. Allow the file to grow during active experiments, then compact or delete it afterward.
129129+130130+The concrete local backup, restore, and disk-growth procedures live in [packages/api/README.md](/Users/owais/Projects/Twisted/packages/api/README.md).
131131+132132+## Migration Path To Production Turso
133133+134134+The migration path is intentionally code-first, not file-first.
135135+136136+Do not promote `twister-dev.db` directly into production. The experimental DB proves schema, queries, and workflow assumptions, but the production dataset should be rebuilt from authoritative upstream sources.
137137+138138+### Phase 1: Stabilize Local Behavior
139139+140140+- finalize schema changes in embedded migrations
141141+- validate search behavior locally
142142+- validate smoke tests against the local workflow
143143+144144+Exit condition:
145145+146146+- a fresh local database can be created from migrations and pass the smoke-test baseline
147147+148148+### Phase 2: Prepare Turso Production Target
149149+150150+- provision the production Turso database
151151+- enable the required SQLite/libSQL features used by Twister
152152+- configure production credentials and environment variables
153153+- verify migrations apply cleanly to an empty production-shaped database
154154+155155+Exit condition:
156156+157157+- Twister can start against an empty Turso database and complete migrations successfully
158158+159159+### Phase 3: Rebuild The Dataset From Sources Of Truth
160160+161161+- start the indexer against Turso
162162+- use Tap backfill and repo-resync paths to rebuild the searchable corpus
163163+- let read-through indexing fill misses during verification
164164+- build the JetStream activity cache from a recent timestamp cursor rather than from copied local state
165165+166166+Exit condition:
167167+168168+- the production Turso dataset is populated from Tap, repo recovery paths, and API-triggered indexing rather than from a copied experimental DB file
169169+170170+### Phase 4: Verify And Cut Over
171171+172172+- run the API smoke scripts against the Turso-backed environment
173173+- confirm health, search, document fetches, indexing, and activity cache behavior
174174+- switch app traffic only after the smoke-test baseline passes
175175+176176+Exit condition:
177177+178178+- production traffic points at the Turso-backed deployment and the local experimental DB is no longer part of the serving path
179179+180180+## Explicit Non-Goal For Migration
181181+182182+The migration plan does not include a direct file copy from local SQLite to production Turso as the default rollout path. If a one-off import becomes necessary later, it should be treated as a separate migration task with its own validation steps.
183183+184184+## Revisit Conditions
185185+186186+Re-open this ADR if any of the following become true:
187187+188188+- Twister needs multiple high-write production workers across separate hosts
189189+- operational requirements start favoring standard PostgreSQL tooling over libSQL continuity
190190+- embedded replicas prove impractical in the Go runtime the project wants to keep
191191+- semantic and hybrid search work introduces storage requirements that fit PostgreSQL materially better
+121
docs/adr/turso.md
···11+---
22+title: ADR Research - Turso For Production Search
33+updated: 2026-03-25
44+status: research
55+---
66+77+## Summary
88+99+Turso is the lowest-migration production candidate because Twister already uses libSQL/SQLite-style storage and query patterns. It preserves the current mental model and minimizes rewrite cost.
1010+1111+The open question is not whether Turso can work, but which Turso mode fits production:
1212+1313+- remote libSQL primary
1414+- local experimentation via plain `file:` SQLite
1515+- Turso embedded-replica style local-read, remote-sync patterns
1616+1717+## Why Consider It
1818+1919+Twister already depends on:
2020+2121+- `github.com/tursodatabase/libsql-client-go`
2222+- SQLite-style migrations
2323+- SQLite FTS5 behavior
2424+2525+That makes Turso the shortest path from current code to a production-capable deployment.
2626+2727+## Fit For Twister
2828+2929+### Strengths
3030+3131+#### Lowest Rewrite Cost
3232+3333+Staying with Turso/libSQL keeps Twister in the same family of database semantics it already uses. Compared with PostgreSQL, this means less work in:
3434+3535+- search query rewrites
3636+- migration rewrites
3737+- ranking behavior drift
3838+- compatibility testing
3939+4040+#### Good Match For Local Experimentation
4141+4242+The current hardening plan already relies on local `file:` workflows to reduce the messiness and cost of experimentation. Turso and libSQL naturally support this style of development.
4343+4444+#### Embedded-Replica Model Is Relevant
4545+4646+Turso's embedded replica story is directly relevant to Twister's workload because it allows:
4747+4848+- local reads from a file-backed database
4949+- sync to a remote primary
5050+- read-your-writes behavior for the initiating replica
5151+- periodic background sync
5252+5353+On paper, this is a strong match for a search service that wants cheap local reads while keeping a remote production database.
5454+5555+## Costs And Risks
5656+5757+### Remote Turso Alone Does Not Solve The Current Pain
5858+5959+The current problem statement came from burning reads and writes during experimentation. A plain remote Turso deployment keeps the same basic cost surface, even if production operations are cleaner than ad hoc local experiments.
6060+6161+### Embedded Replicas Have Important Caveats
6262+6363+Turso's embedded replicas are promising, but the docs call out constraints that matter for Twister:
6464+6565+- they require a real filesystem
6666+- they are not suitable for serverless environments without disk
6767+- local DB files should not be opened while syncing
6868+- sync behavior can amplify writes because replication is frame-based
6969+7070+This means the operational model has to be chosen carefully. It is not a free "best of both worlds" switch.
7171+7272+### Current Go Stack Makes The Best Turso Story Harder
7373+7474+This is the biggest repo-specific caveat.
7575+7676+The Turso Go quickstart notes that `github.com/tursodatabase/libsql-client-go/libsql` does not support embedded replicas. Twister currently uses that library for remote libSQL access, while local file mode is handled separately with `modernc.org/sqlite`.
7777+7878+Twister also currently builds with `CGO_ENABLED=0` in `packages/api/justfile`.
7979+8080+That means the cleanest embedded-replica path may require:
8181+8282+- changing drivers
8383+- reconsidering the pure-Go build constraint
8484+- accepting CGO in production builds, or waiting for a better pure-Go story
8585+8686+So while Turso embedded replicas are attractive in principle, they are not a drop-in upgrade for the current codebase.
8787+8888+## Repo-Specific Implications
8989+9090+- Remote Turso/libSQL is the easiest production continuation of the current code.
9191+- Local `file:` mode is already useful for stabilizing experimentation.
9292+- Embedded replicas are strategically interesting but would likely force deeper driver and build changes than the current roadmap implies.
9393+9494+## When Turso Is The Better Choice
9595+9696+Choose Turso if most of the following are true:
9797+9898+- minimizing migration cost is the top priority
9999+- preserving current SQLite FTS behavior matters
100100+- production can tolerate a simpler deployment model, especially early on
101101+- We want to keep search and experimentation close to the current implementation
102102+103103+## When Turso Needs Extra Caution
104104+105105+Be careful with Turso if most of the following are true:
106106+107107+- Twister needs multiple production writers across separate hosts
108108+- the system must avoid CGO and keep pure-Go builds
109109+- We expect embedded replicas to be a near-term production feature
110110+- operational simplicity matters more than minimizing query rewrites
111111+112112+## Recommendation
113113+114114+Turso is the best near-term production candidate if the goal is minimum code churn and continuity with the current search stack.
115115+116116+Remote Turso is the easiest short path. Embedded replicas are the most interesting medium-term Turso option, but they should be treated as additional engineering work rather than an assumption, especially given the current Go driver and build setup.
117117+118118+## Sources
119119+120120+- [Turso Go quickstart](https://docs.turso.tech/sdk/go/quickstart#local-only)
121121+- [Turso embedded replicas docs](https://docs.turso.tech/features/embedded-replicas)
+35-1
docs/roadmap.md
···11---
22title: Roadmap
33-updated: 2026-03-24
33+updated: 2026-03-25
44---
5566+## API: Search Stabilization
77+88+Highest priority. This work blocks further investment in semantic search, hybrid ranking, and broader discovery features.
99+1010+- [ ] Stabilize local development and experimentation around a local `file:` database
1111+- [x] Document backup, restore, and disk-growth procedures for the experimental local DB
1212+- [x] Research production backend options: PostgreSQL, Turso remote/libSQL, and Turso embedded replicas
1313+- [x] Write a production storage decision record with workload and operational tradeoffs, using `docs/adr/pg.md` and `docs/adr/turso.md`
1414+- [x] Define the migration path from the experimental local setup to the chosen production backend
1515+- [ ] Add cURL smoke tests for `healthz`, `readyz`, `search`, `documents`, indexing, and activity in `scripts/api/`
1616+ - desertthunder.dev DID: `did:plc:xg2vq45muivyy3xwatcehspu`
1717+ - Twisted AT URI: `at://did:plc:xg2vq45muivyy3xwatcehspu/sh.tangled.repo/3mho6hukiei22`
1818+ - Profile AT URI: `at://did:plc:xg2vq45muivyy3xwatcehspu/sh.tangled.actor.profile/self`
1919+ - Follow AT URI (desertthunder.dev follows npmx): `at://did:plc:xg2vq45muivyy3xwatcehspu/sh.tangled.graph.follow/3mhofstanru22`
2020+ - Star AT URI (desertthunder.dev stars microcosm-rs): `at://did:plc:lulmyldiq4sb2ikags5sfb25/sh.tangled.repo/3lvsxzinfz222`
2121+- ~~Add `just` targets for smoke-test runs locally and against a remote base URL~~ directly invoking the scripts is fine.
2222+- [ ] Add a durable read-through indexing job queue for records fetched through the API
2323+- [ ] Reuse the existing normalization and upsert path for on-demand indexing jobs
2424+- [ ] Trigger indexing jobs from repo, issue, PR, profile, and similar fetch handlers
2525+- [ ] Add dedupe, retries, and observability for indexing jobs
2626+- [ ] Add a JetStream cache consumer with a persisted timestamp cursor
2727+- [ ] Seed the JetStream cursor to `now - 24h` on first boot and rewind slightly on reconnect
2828+- [ ] Store and serve bounded recent activity from the local cache
2929+- [ ] Keep Tap as the authoritative indexing and bulk backfill path
3030+- [ ] Define a controlled backfill and repo-resync playbook for recovery
3131+632## API: Constellation Integration
733834Add a Constellation client to the Go API for enriching search results with social signals.
···17431844Nomic Embed Text v1.5 via Railway template, async embedding pipeline.
19454646+**Blocked on:** API: Search Stabilization
4747+2048- [ ] Deploy nomic-embed Railway template (`POST /api/embeddings` with Bearer auth)
2149- [ ] Embedding client in Go API (`internal/embedding/`) calling the Nomic service
2250- [ ] Embed-worker: consume `embedding_jobs` queue, generate 768-dim vectors, store in `document_embeddings`
···2654## API: Hybrid Search
27552856Combine keyword and semantic results.
5757+5858+**Blocked on:** API: Search Stabilization, API: Semantic Search Pipeline
29593060- [ ] Score normalization (keyword BM25 → [0,1], semantic cosine → [0,1])
3161- [ ] Weighted merge (0.65 keyword + 0.35 semantic, configurable)
···34643565## API: Search Quality
36666767+**Blocked on:** API: Search Stabilization
6868+3769- [ ] Field weight tuning based on real queries
3870- [ ] Recency boost for recently updated content
3971- [ ] Star count ranking signal (via Constellation)
···4274- [ ] Relevance test fixtures
43754476## API: Observability
7777+7878+**Depends on:** API: Search Stabilization
45794680- [ ] Structured metrics: ingestion rate, search latency, embedding throughput
4781- [ ] Dashboard or log-based monitoring
+36-8
docs/specs/data-sources.md
···11---
22title: Data Sources & Integration
33-updated: 2026-03-24
33+updated: 2026-03-25
44---
5566-Twisted pulls data from four external sources and authenticates users via Bluesky OAuth. Each source has a distinct role — no single source is authoritative for everything.
66+Twisted pulls data from five external sources and authenticates users via Bluesky OAuth. Each source has a distinct role — no single source is authoritative for everything.
7788## Source Overview
991010-| Source | What it provides | Access pattern |
1111-| ------------------------ | ------------------------------------------------------------------------------ | ---------------------------------------------------------- |
1212-| **Tangled XRPC (Knots)** | Git data — file trees, blobs, commits, branches, diffs, tags | Direct XRPC calls to the knot hosting each repo |
1313-| **AT Protocol (PDS)** | User records — profiles, repos, issues, PRs, comments, stars, follows | `com.atproto.repo.getRecord` / `listRecords` on user's PDS |
1414-| **Constellation** | Social signals — star counts, follower counts, reaction counts, backlink lists | Public JSON API at `constellation.microcosm.blue` |
1515-| **Tap** | Real-time firehose of AT Protocol record events for indexing | WebSocket consumer, feeds our search index |
1010+| Source | What it provides | Access pattern |
1111+| --- | --- | --- |
1212+| **Tangled XRPC (Knots)** | Git data — file trees, blobs, commits, branches, diffs, tags | Direct XRPC calls to the knot hosting each repo |
1313+| **AT Protocol (PDS)** | User records — profiles, repos, issues, PRs, comments, stars, follows | `com.atproto.repo.getRecord` / `listRecords` on user's PDS |
1414+| **Constellation** | Social signals — star counts, follower counts, reaction counts, backlink lists | Public JSON API at `constellation.microcosm.blue` |
1515+| **Tap** | Real-time firehose of AT Protocol record events for authoritative indexing | WebSocket consumer, feeds the search index |
1616+| **JetStream** | Recent JSON activity stream for cached feed data | WebSocket consumer, feeds a bounded recent-activity cache |
16171718## Constellation
1819···114115115116Stars, followers, reactions — Constellation handles counts and lists. We still process these events for graph discovery but don't need to maintain our own counters.
116117118118+### Role In The Search Plan
119119+120120+Tap remains the authoritative ingestion and backfill path for searchable documents. If search correctness depends on complete historical coverage, Tap or a repo resync path is the right source.
121121+117122### Tap Protocol
118123119124- WebSocket connection with cursor-based resume
120125- Events contain: operation (create/update/delete), DID, collection, rkey, CID, record payload
121126- Acks required after processing each event
122127- Backfill via `/repos/add` endpoint to request historical data for specific users
128128+129129+## JetStream
130130+131131+JetStream is a lighter JSON stream derived from the firehose. It is useful for recent activity and developer ergonomics, but it is not the authoritative source for search indexing.
132132+133133+### Usage In Twisted
134134+135135+- Recent activity cache for the Activity tab
136136+- Collection-filtered stream for `sh.tangled.*` events
137137+- Cursor-based resume using event timestamps
138138+139139+### Constraints
140140+141141+- Use JetStream for recent, cached activity only
142142+- Do not rely on it as the only historical backfill mechanism
143143+- Keep retention bounded and reconnect idempotent
144144+145145+### Role In The Search Plan
146146+147147+- Seed the cursor to roughly 24 hours ago on first boot
148148+- Persist the last processed timestamp and rewind slightly on reconnect
149149+- Cache normalized activity locally so clients do not each need a raw upstream stream
150150+- Keep Tap as the source of truth for search indexing and bulk backfill
123151124152## Bluesky OAuth
125153
+162-28
docs/specs/search.md
···11---
22title: Search
33-updated: 2026-03-24
33+updated: 2026-03-25
44---
5566-Search lets users find repos, issues, PRs, profiles, and code snippets across the Tangled network. The API supports three modes with progressive capability.
66+> Warning: this document is pretty long. Look at the roadmap and ADR summaries for a
77+> high-level overview, or jump to the relevant sections.
88+99+Search now has two phases:
1010+1111+1. Stabilize indexing and activity caching so search is cheap and reliable.
1212+2. Resume semantic and hybrid work only after the base pipeline is stable.
1313+1414+## Immediate Priority
1515+1616+The current highest-priority search work is operational, not ranking:
1717+1818+- Stabilize experimentation around a local `file:` database workflow.
1919+- Add cURL smoke tests for search, document fetches, indexing, and activity reads.
2020+- Enqueue background indexing when the API fetches records that are not yet searchable.
2121+- Cache recent JetStream activity server-side with a persisted 24-hour cursor.
2222+2323+Production storage is Turso cloud. The reasoning is recorded in `docs/adr/storage.md`, with the comparison inputs in `docs/adr/pg.md` and `docs/adr/turso.md`.
2424+2525+These tasks block further work on semantic and hybrid search.
2626+2727+## Planning Decisions
2828+2929+### Why This Comes First
3030+3131+Search quality is currently constrained more by ingestion cost and freshness gaps than by ranking quality.
3232+The next iteration should make Twister cheaper to operate, resilient across restarts, and able to backfill misses on demand before any new semantic or hybrid work.
3333+3434+### Resolved Questions
3535+3636+#### Local-Only Storage
3737+3838+Twister can already run against a local `file:` database. That is useful for stabilizing development and experimentation while the indexing model is still changing. It should not automatically be treated as the final production architecture.
3939+4040+The production storage question remains open and should compare at least:
4141+4242+- PostgreSQL with native full-text search and conventional operational tooling
4343+- Turso remote/libSQL
4444+- Turso with embedded replicas or similar local-read, remote-sync patterns
4545+4646+That comparison has been completed, and the current production choice is Turso.
4747+4848+#### Tangled First-Commit Timestamp
4949+5050+The first Tangled commit timestamp is useful as a lower-bound hint for one-time experiments, but it should not become the default replay cursor.
5151+JetStream has to default to recent history (< 72 hours from now is what's possible) so bootstrap cost stays bounded.
75288-## Modes
5353+#### Tap Versus JetStream
5454+5555+Tap remains the authoritative indexing and bulk backfill path. JetStream should power only a bounded recent-activity cache.
5656+Read-through API indexing closes gaps when a user fetches a record before Tap has delivered it.
5757+5858+## Goals
5959+6060+- Reduce search-related reads and writes enough that remote Turso cost is no longer the dominant constraint.
6161+- Keep indexed content fresh enough for browsing and search without requiring a full-network rebuild after routine restarts.
6262+- Serve recent activity cheaply from a local cache.
6363+- Add a smoke-test layer that verifies search and indexing behavior end to end.
6464+6565+## Current Search Mode
9661067### Keyword Search (Implemented)
11681212-Full-text search powered by SQLite FTS5 with BM25 scoring. Queries are tokenized, matched against title, body, summary, repo name, author handle, and tags. Results are ranked by relevance with field-specific weights (title highest, then author handle, summary, body).
6969+Full-text search is powered by SQLite FTS5 with BM25 scoring. Queries match title, body, summary, repo name, author handle, and tags. Results are ranked with field-specific weights and snippets highlight matches with `<mark>` tags.
13701414-Snippets are generated from the body field with match terms wrapped in `<mark>` tags.
7171+## Stabilization Plan
15721616-### Semantic Search (Planned)
7373+### Storage
17741818-Vector similarity search using **Nomic Embed Text v1.5**, deployed on Railway via the [nomic-embed template](https://railway.com/deploy/nomic-embed). The template runs Ollama behind an authenticated Caddy proxy.
7575+Twister should use a local `file:` database to stabilize experimentation and reduce the messiness of iteration while the indexing pipeline is being hardened. Production storage should remain explicitly undecided until the project compares PostgreSQL and Turso-based options against the final workload.
19762020-**Embedding service:**
7777+Requirements:
21782222-- Model: `nomic-embed-text:latest` (8192-token context, 768-dimensional vectors, Matryoshka support for variable dimensionality)
2323-- Endpoint: `POST /api/embeddings` with Bearer token auth
2424-- Request: `{ "model": "nomic-embed-text:latest", "prompt": "text to embed" }`
2525-- Deployed as a separate Railway service alongside the API and indexer
7979+- keep local-file mode as the simplest path for development and experimentation
8080+- document what assumptions the local path makes about single-host or shared-disk execution
8181+- document backup, restore, and disk-growth procedures
8282+- produce a production storage decision record comparing PostgreSQL and Turso options, starting from `docs/adr/pg.md` and `docs/adr/turso.md`
26832727-**Pipeline:**
8484+Evaluation criteria for the production decision:
28852929-- The embed-worker consumes the `embedding_jobs` queue, calls the Nomic Embed service, and stores 768-dim vectors in the `document_embeddings` table
3030-- Documents are embedded asynchronously after indexing — the embed-worker runs independently of the ingestion loop
3131-- Search queries are embedded at request time (single prompt, low latency)
3232-- Vectors are matched via DiskANN cosine similarity index in Turso
8686+- write-heavy ingestion behavior
8787+- FTS quality and indexing ergonomics
8888+- operational complexity and backup story
8989+- latency for reads and writes
9090+- failure recovery and restore workflow
9191+- support for future semantic search requirements
33923434-### Hybrid Search (Planned)
9393+Acceptance:
35943636-Weighted combination of keyword and semantic results. Default blend: 0.65 keyword + 0.35 semantic (configurable). Scores are normalized to [0, 1] before blending. Results are deduplicated by document ID with the higher score retained. Each result includes a `matched_by` field indicating which mode(s) contributed.
9595+- local development no longer depends on remote Turso for routine experimentation
9696+- the production backend choice is documented with explicit tradeoffs
9797+- the chosen production backend has a migration path from the experimental local setup
9898+9999+The concrete local DB operating procedure lives in `packages/api/README.md`.
100100+The production migration path is documented in `docs/adr/storage.md`.
101101+102102+### Read-Through Indexing
103103+104104+When the API fetches a repo, issue, PR, profile, or similar record directly from upstream, it should enqueue background indexing work if that record is not already searchable. Tap remains the primary ingest path; read-through indexing only closes gaps.
105105+106106+Requirements:
107107+108108+- add a durable job table for on-demand indexing
109109+- deduplicate jobs by stable document identity
110110+- reuse the existing normalization and upsert path
111111+- trigger jobs from the handlers that already fetch upstream records
112112+113113+Acceptance:
114114+115115+- a fetched-but-missing record becomes searchable shortly after the first successful API read
116116+- repeated page views do not create unbounded duplicate work
117117+- failures are visible through logs and smoke tests
118118+119119+### Activity Cache
120120+121121+JetStream should back a recent-activity cache, not the main search index. The server should persist a timestamp cursor, seed it to `now - ~24h` on first boot, rewind slightly on reconnect, and expire old events aggressively.
122122+123123+Requirements:
124124+125125+- add a dedicated activity cache table
126126+- persist a separate JetStream consumer cursor
127127+- seed missing cursors to recent history, not full history
128128+- keep retention bounded by age and row count
129129+130130+Acceptance:
131131+132132+- common activity reads can be served from the cache
133133+- restarts resume from the stored timestamp cursor
134134+- reconnects are idempotent and tolerate a short rewind window
135135+136136+### Smoke Tests
137137+138138+Twister needs cURL-based smoke tests covering:
139139+140140+- `GET /healthz`
141141+- `GET /readyz`
142142+- `GET /search`
143143+- `GET /documents/{id}`
144144+- one fetch path that should enqueue indexing
145145+- one activity endpoint backed by the cache
146146+147147+Acceptance:
148148+149149+- one local command can verify the critical API surface
150150+- the same scripts can run against staging or production by changing the base URL
151151+152152+## Operational Model
153153+154154+1. Tap ingests the authoritative search corpus.
155155+2. Direct API reads enqueue background indexing for misses.
156156+3. JetStream fills only the recent-activity cache.
157157+4. Smoke tests guard the critical paths.
158158+5. Semantic and hybrid search remain blocked until the base pipeline is stable.
159159+160160+## Backfill Strategy
161161+162162+- Search index backfill should continue to use Tap admin backfill, firehose-driven repo sync, or repo export based resync.
163163+- Activity cache bootstrap should use a recent JetStream timestamp cursor, defaulting to `now - 24h`.
164164+- A manual cursor override can exist for one-time replay experiments, but it should not be the default startup path.
3716538166## API Contract
39167···95223962242. **Constellation supplements search results.** Star counts and follower counts from Constellation can be used as ranking signals without needing to index interaction records ourselves.
972259898-3. **Semantic search is additive.** It improves discovery for vague queries but isn't required for the app to be useful. It ships when the embedding pipeline is stable.
226226+3. **Read-through indexing closes freshness gaps.** If a user can fetch a record, the system should be able to make it searchable shortly after.
99227100100-4. **Graceful degradation.** The mobile app treats the search API as optional. If Twister is unavailable, handle-based direct browsing still works. Search results link into the same browsing screens.
228228+4. **JetStream is for recent activity, not authoritative indexing.** Use it to power the cached feed, not to replace Tap or repo re-sync.
229229+230230+5. **Semantic search is additive.** It improves discovery for vague queries but is not required for the app to be useful.
231231+232232+6. **Graceful degradation.** The mobile app treats the search API as optional. If Twister is unavailable, handle-based direct browsing still works. Search results link into the same browsing screens.
101233102234## Quality Improvements (Planned)
103235104236- Field weight tuning based on real query patterns
105237- Recency boost for recently updated content
106106-- Collection-aware ranking (repos weighted higher for short queries)
107107-- Star count as a ranking signal (via Constellation)
108108-- State filtering (exclude closed issues by default)
109109-- Better snippet generation with longer context windows
110110-- Relevance test fixtures for regression testing
238238+- Collection-aware ranking
239239+- Star count as a ranking signal
240240+- State filtering defaults
241241+- Better snippet generation
242242+- Relevance test fixtures
111243112244## Mobile Integration
113245114114-The app calls the search API from the Explore tab. Results are displayed in segmented views (repos, users, issues/PRs). Each result links to the corresponding browsing screen (repo detail, profile, issue detail).
246246+The app calls the search API from the Explore tab. Results are displayed in segmented views (repos, users, issues/PRs).
247247+Each result links to the corresponding browsing screen (repo detail, profile, issue detail).
115248116116-When the search API is unavailable, the Explore tab shows an appropriate state rather than breaking. The Home tab's handle-based browsing is fully independent of search.
249249+When the search API is unavailable, the Explore tab shows an appropriate state rather than breaking.
250250+The Home tab's handle-based browsing is fully independent of search.
+7-3
docs/todo.md
···11---
22title: Parking Lot
33-updated: 2026-03-24
33+updated: 2026-03-25
44---
5566-- Constellation requests would be a good opportunity to dispatch a job to index what
77- was requested.
66+Search stabilization is active roadmap work now, not parking-lot work.
77+88+Still parked:
99+1010+- Semantic and hybrid search stay deferred until local-only storage, smoke tests, read-through indexing, and the JetStream cache are stable.
1111+- Revisit `com.atproto.sync.listReposByCollection` as a complementary backfill discovery source after the Tap-driven indexing path is reliable.
+79
packages/api/README.md
···18181919The server listens on `:8080` by default. Logs are printed as text when `--local` is set.
20202121+## Experimental Local DB Operations
2222+2323+The experimental local database lives at `packages/api/twister-dev.db` when you run Twister from `packages/api` with `--local`.
2424+2525+This database is for local experimentation only. Treat it as disposable unless you explicitly back it up.
2626+2727+### Backup
2828+2929+Recommended procedure:
3030+3131+1. Stop the Twister process using the local DB.
3232+2. Copy the database file and any SQLite sidecar files if they exist.
3333+3434+Example:
3535+3636+```sh
3737+cd packages/api
3838+mkdir -p backups
3939+timestamp="$(date +%Y%m%d-%H%M%S)"
4040+cp twister-dev.db "backups/twister-dev-${timestamp}.db"
4141+test -f twister-dev.db-wal && cp twister-dev.db-wal "backups/twister-dev-${timestamp}.db-wal"
4242+test -f twister-dev.db-shm && cp twister-dev.db-shm "backups/twister-dev-${timestamp}.db-shm"
4343+```
4444+4545+For this experimental DB, stop-and-copy is preferred over hot backup complexity.
4646+4747+### Restore
4848+4949+Recommended procedure:
5050+5151+1. Stop the Twister process.
5252+2. Move the current local DB aside if you want to keep it.
5353+3. Copy the backup file back to `twister-dev.db`.
5454+4. Restore matching `-wal` and `-shm` files only if they were captured with the same backup set.
5555+5656+Example:
5757+5858+```sh
5959+cd packages/api
6060+mv twister-dev.db "twister-dev.db.broken.$(date +%Y%m%d-%H%M%S)" 2>/dev/null || true
6161+cp backups/twister-dev-YYYYMMDD-HHMMSS.db twister-dev.db
6262+```
6363+6464+After restore, restart Twister and let the app run migrations normally.
6565+6666+### Disk Growth
6767+6868+The local DB will grow during experimentation because of:
6969+7070+- indexed documents
7171+- FTS tables
7272+- activity cache rows
7373+- repeated backfill or reindex runs
7474+7575+Recommended operating procedure:
7676+7777+1. Check file growth periodically.
7878+2. Delete and rebuild the experimental DB freely when the dataset is no longer useful.
7979+3. Run `VACUUM` only when you intentionally want to compact a long-lived local DB.
8080+4. Keep old backups out of the repo and rotate them manually.
8181+8282+Example inspection commands:
8383+8484+```sh
8585+cd packages/api
8686+du -h twister-dev.db*
8787+ls -lh twister-dev.db*
8888+```
8989+9090+For experimental use, the simplest policy is usually:
9191+9292+- back up anything worth keeping
9393+- remove the DB when the experiment is over
9494+- let Twister rebuild from migrations and backfill paths
9595+9696+### Failure Recovery Rule
9797+9898+If the experimental DB becomes suspicious or inconsistent, prefer restore-or-rebuild over manual repair. This is a developer convenience database, not the source of truth.
9999+21100## Environment variables
2210123102Copy `.env.example` to `.env` in the repo root (or `packages/api/`). The server loads `.env`, `../.env`, and `../../.env` automatically.