···10101111## 2. Stack
12121313-| Layer | Technology |
1414-| ---------------- | --------------------------------------------------------------- |
1515-| Backend | Go |
1616-| Database | SQLite (3 files: users, articles, recs via `mattn/go-sqlite3` + `sqlite-vec` for vector search) |
1717-| Frontend | htmx + TailwindCSS |
1818-| Auth | AT Protocol OAuth / DID resolution (configurable PLC directory) |
1919-| AT Protocol role | AppView for `at.glean.*` lexicons |
2020-| Data source | AT Protocol Jetstream → SQLite index |
1313+| Layer | Technology |
1414+| ---------------- | ----------------------------------------------------------------------------------------------- |
1515+| Backend | Go |
1616+| Database | SQLite (3 files: users, articles, recs via `mattn/go-sqlite3` + `sqlite-vec` for vector search) |
1717+| Frontend | htmx + TailwindCSS |
1818+| Auth | AT Protocol OAuth / DID resolution (configurable PLC directory) |
1919+| AT Protocol role | AppView for `at.glean.*` lexicons |
2020+| Data source | AT Protocol Jetstream → SQLite index |
21212222## 3. AT Protocol Lexicons
2323···645645646646### 7.1 Signals
647647648648-| Signal | Source | Weight (default) | Description |
649649-| ------------ | ------------------------ | ---------------- | -------------------------------------------------- |
650650-| Subscription | `subscriptions` | 1.0 | Jaccard over subscriber sets between similar users |
651651-| Like | `likes` | 0.5 | Time-decayed like co-occurrence (30-day half-life) |
652652-| Tag | `annotations.tags` | 0.3 | Jaccard over annotation tag sets |
653653-| Social | `follow_distances` | 0.7 | Follow distance: 1-hop=1.0, 2-hop=0.3, 3-hop=0.1 |
654654-| Popularity | `feeds.subscriber_count` | 0.2 | `log(1 + subscribers) / log(1 + max)` |
655655-| Category | `subscriptions.category` | 0.4 | Boost feeds matching user's existing categories |
648648+| Signal | Source | Weight (default) | Description |
649649+| ------------ | ------------------------ | ---------------- | ------------------------------------------------------- |
650650+| Subscription | `subscriptions` | 1.0 | Jaccard over subscriber sets between similar users |
651651+| Like | `likes` | 0.5 | Time-decayed like co-occurrence (30-day half-life) |
652652+| Tag | `annotations.tags` | 0.3 | Jaccard over annotation tag sets |
653653+| Social | `follow_distances` | 0.7 | Follow distance: 1-hop=1.0, 2-hop=0.3, 3-hop=0.1 |
654654+| Popularity | `feeds.subscriber_count` | 0.2 | `log(1 + subscribers) / log(1 + max)` |
655655+| Category | `subscriptions.category` | 0.4 | Boost feeds matching user's existing categories |
656656| Content | `article_embeddings` | 0.4 | Cosine similarity via embedding KNN (requires embedder) |
657657658658### 7.2 Feed Co-occurrence (Jaccard Similarity)
···759759760760A background goroutine runs on a configurable schedule (`GLEAN_CLUSTER_INTERVAL`, default 10m):
761761762762-1. **Compute feed embeddings**: Embed new feed descriptions via OpenAI-compatible API into `feed_embeddings` table (skipped if no embedder configured)
762762+1. **Compute feed embeddings**: Embed new feed descriptions via embedding API into `feed_embeddings` table (skipped if no embedder configured)
7637632. **Compute feed similarity**: Batch-update `feed_similarity` table (Jaccard over subscriber sets + embedding cosine similarity)
7647643. **Compute user similarity**: Batch-update `user_similarity` table (subscription Jaccard + time-decayed likes + tags + follow boost)
765765-4. **Compute article embeddings**: Embed new articles' full content (`title + summary + full_content + content`) via OpenAI-compatible API into `article_embeddings` vec0 table (skipped if no embedder configured)
765765+4. **Compute article embeddings**: Embed new articles (`title + summary + content`, excluding `full_content` to stay within embedding model token limits) via embedding API into `article_embeddings` vec0 table (skipped if no embedder configured)
7667665. **Compute follow distances**: Incremental BFS for dirty users (1-hop through 3-hop from `follows` table)
7677676. **Compute signal profiles**: Per-user category/tag/like summaries
7687687. **Auto-dismiss stale**: Dismiss items shown >=5 times over >5 days without action
···854854855855When `GLEAN_EMBED_BASE_URL` is configured, article text and feed descriptions are embedded into vectors stored in `sqlite-vec` virtual tables (`recs.feed_embeddings`, `recs.article_embeddings`). The vec0 extension provides native KNN vector search via `WHERE embedding MATCH ? AND k = ?`, replacing Go-side cosine similarity for large-scale lookups. Without embeddings, recommendations rely only on subscription overlap, like patterns, and social graph — no content-based signals.
856856857857-The embedder uses the official `github.com/openai/openai-go` SDK with `option.WithBaseURL()`, so any OpenAI-compatible `/v1/embeddings` endpoint works (OpenAI, Ollama, local inference servers).
857857+The embedder uses the official `github.com/openai/openai-go` SDK with `option.WithBaseURL()`, so any OpenAI-compatible `/v1/embeddings` endpoint works (OpenAI, Gemini, Ollama, local inference servers).
858858859859vec0 tables are created dynamically at startup with the configured dimension (`GLEAN_EMBED_DIMENSION`, default 1536):
860860···879879);
880880```
881881882882-During cron, `ComputeArticleEmbeddings` embeds new articles in batches (using `title + summary + full_content + content` for maximum semantic coverage) and inserts them into the vec0 table. `ComputeFeedEmbeddings` embeds feed descriptions (`title || description`) and re-embeds when the source text changes (detected via `feed_embedding_meta`). During on-demand article recommendations, the user's liked article embeddings are averaged into an interest vector, then a vec0 KNN query finds the top-200 most semantically similar articles. For cold-start users (<5 subscriptions), their subscribed feed embeddings are averaged and a KNN query finds similar feeds.
882882+During cron, `ComputeArticleEmbeddings` embeds new articles in batches using `title + summary + content` (the scraped `full_content` is excluded to stay within model token limits — most embedding models cap at ~8k tokens). Text is truncated to 8000 characters as a safety net. Batches are capped at 100 inputs per API call. `ComputeFeedEmbeddings` embeds feed descriptions (`title || description`) and re-embeds when the source text changes (detected via `feed_embedding_meta`). During on-demand article recommendations, the user's liked article embeddings are averaged into an interest vector, then a vec0 KNN query finds the top-200 most semantically similar articles. For cold-start users (<5 subscriptions), their subscribed feed embeddings are averaged and a KNN query finds similar feeds.
883883884884## 8. HTTP API / htmx Endpoints
885885
+23-6
internal/cluster/article.go
···1010 vec "github.com/asg017/sqlite-vec-go-bindings/cgo"
1111)
12121313+// embedBatchSize caps how many texts are sent in a single embedding API call.
1414+// Most /v1/embeddings endpoints accept up to ~2048 inputs per request; lower
1515+// values reduce payload size and memory pressure.
1316const embedBatchSize = 100
14171515-// ComputeArticleEmbeddings embeds new articles (title + summary) into the
1616-// article_embeddings vec0 table. Skipped when no embedder is configured.
1717-// Existing embeddings for deleted articles are cleaned up. Articles already
1818-// embedded are not re-embedded.
1818+// maxEmbedChars truncates text sent to the embedding model. Most models have
1919+// a token context window (~4 chars/token); title+summary+content stays well
2020+// under this, but the cap is kept as a safety net.
2121+const maxEmbedChars = 8000
2222+2323+func truncateForEmbed(s string) string {
2424+ if len(s) <= maxEmbedChars {
2525+ return s
2626+ }
2727+ return s[:maxEmbedChars]
2828+}
2929+3030+// ComputeArticleEmbeddings embeds new articles (title + summary + content) into
3131+// the article_embeddings vec0 table. full_content (scraped body) is excluded
3232+// because embedding models have token context limits and the feed-provided
3333+// content already captures the topical signal needed for recommendation KNN.
3434+1935func (e *Engine) ComputeArticleEmbeddings(ctx context.Context) error {
2036 if e.embedder == nil {
2137 e.logger.Debug("article embeddings skipped, no embedder")
···3753 }
38543955 rows, err := conn.QueryContext(ctx, `
4040- SELECT a.id, COALESCE(a.title, '') || ' ' || COALESCE(a.summary, '') || ' ' || COALESCE(a.full_content, '') || ' ' || COALESCE(a.content, '')
5656+ SELECT a.id, COALESCE(a.title, '') || ' ' || COALESCE(a.summary, '') || ' ' || COALESCE(a.content, '')
4157 FROM articles.articles a
4242- WHERE (COALESCE(a.title, '') != '' OR COALESCE(a.summary, '') != '' OR COALESCE(a.full_content, '') != '' OR COALESCE(a.content, '') != '')
5858+ WHERE (COALESCE(a.title, '') != '' OR COALESCE(a.summary, '') != '' OR COALESCE(a.content, '') != '')
4359 AND a.id NOT IN (SELECT article_id FROM recs.article_embeddings)
4460 ORDER BY a.id
4561 `)
···5874 rows.Close()
5975 return err
6076 }
7777+ a.text = truncateForEmbed(a.text)
6178 batch = append(batch, a)
6279 }
6380 rows.Close()